This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Research Protocols, is properly cited. The complete bibliographic information, a link to the original publication on https://www.researchprotocols.org, as well as this copyright and license information must be included.
Multiple long-term health conditions (multimorbidity) (MLTC-M) are increasingly prevalent and associated with high rates of morbidity, mortality, and health care expenditure. Strategies to address this have primarily focused on the biological aspects of disease, but MLTC-M also result from and are associated with additional psychosocial, economic, and environmental barriers. A shift toward more personalized, holistic, and integrated care could be effective. This could be made more efficient by identifying groups of populations based on their health and social needs. In turn, these will contribute to evidence-based solutions supporting delivery of interventions tailored to address the needs pertinent to each cluster. Evidence is needed on how to generate clusters based on health and social needs and quantify the impact of clusters on long-term health and costs.
We intend to develop and validate population clusters that consider determinants of health and social care needs for people with MLTC-M using data-driven machine learning (ML) methods compared to expert-driven approaches within primary care national databases, followed by evaluation of cluster trajectories and their association with health outcomes and costs.
The mixed methods program of work with parallel work streams include the following: (1) qualitative semistructured interview studies exploring patient, caregiver, and professional views on clinical and socioeconomic factors influencing experiences of living with or seeking care in MLTC-M; (2) modified Delphi with relevant stakeholders to generate variables on health and social (wider) determinants and to examine the feasibility of including these variables within existing primary care databases; and (3) cohort study with expert-driven segmentation, alongside data-driven algorithms. Outputs will be compared, clusters characterized, and trajectories over time examined to quantify associations with mortality, additional long-term conditions, worsening frailty, disease severity, and 10-year health and social care costs.
The study will commence in October 2021 and is expected to be completed by October 2023.
By studying MLTC-M clusters, we will assess how more personalized care can be developed, how accurate costs can be provided, and how to better understand the personal and medical profiles and environment of individuals within each cluster. Integrated care that considers “whole persons” and their environment is essential in addressing the complex, diverse, and individual needs of people living with MLTC-M.
PRR1-10.2196/34405
Multiple long-term health conditions (multimorbidity) (MLTC-M) have been defined in the 2018 Academy of Medical Sciences policy report [
These impacts emphasize the need for a deeper understanding of MLTC-M in relation to physical health, mental health, and social well-being. Integrated care may have the potential to address MLTC-M more effectively, although current evidence offers a mixed picture of the efficacy of integration in addressing the complex care needs of this cohort of patients. Previous MLTC-M research in the United Kingdom shows that integrated services in MLTC-M contributes to higher patient satisfaction [
Data sets comprising millions of patient records including measures of health and socioeconomic determinants alongside subsequent health and social needs over the life course of a patient with MLTC-M are increasingly available. This provides opportunities to advance the understanding of MLTC-M toward the delivery of truly person-centered and holistic care. At present, efforts to improve care focus on approaches that primarily address biological needs, rather than considering the impact of wider health and social determinants on individuals living with several conditions at the same time [
Operationalizing holistic and integrated care is challenging due to the level of personalization required across the health and social care continuum. At an individual level, it is costly and difficult to implement. Clustering heterogeneous populations into relatively homogenous subgroups with similar health and socioeconomic determinants and needs and then tailoring appropriate interventions to each cluster could offer a pragmatic solution. Studies have demonstrated the potential of clustering for integrating health and social care using expert-driven segmentation [
Data-driven approaches include unsupervised artificial intelligence (AI) algorithms, including metric learning or variational autoencoder frameworks. The feature selection and engineering process will initially be informed by expert- and patient-proposed variables, but deep ML can extend these using self-learning. Clusters generated by deep artificial neural networks (ANNs) are more likely to be homogenous and predict trajectories. For example, a study of 2449 participants in Taiwan combined medical and socioeconomic data to generate data-driven clusters that accurately predicted service usage and expenditure [
Advances in data-driven processing paradigms could overcome previous limitations in methodology using unsupervised or semisupervised deep embedded clustering [
We aim to develop and validate population clusters that consider health and social care determinants and subsequent health and care needs for people with MLTC-M using data-driven AI methods compared to expert-driven approaches, followed by evaluation of cluster trajectories and their associations with health outcomes and costs.
We will carry out a longitudinal mixed methods study with 3 parallel work streams including (1) qualitative interview study, (2) modified Delphi, and (3) cohort study. They are described below.
Email and postal invitations will be sent to participants who have expressed interest through advertisements viewed on social media, local community centers, the university website, charity newsletters, caregiver support networks, and through word of mouth. Given the complex structure of health and social care, an iterative and a proactive recruitment approach will be necessary. To include hard-to-reach and underrepresented groups that reflect diversity in social needs, we will recruit at events, such as those held in local authority facilities and community or faith centers, as well as seek additional expert input through established Black and minority ethnic networks. We will aim for a representative sample of <30 interviews, as our pilot interview study [
Semistructured interviews will explore views on health and social needs over the course of living with or supporting MLTC-M and views on possible intervention components identified from our preliminary work. Telephone or internet-based video interviews will be conducted by trained researchers. An interview schedule will be designed covering broad open questions to enable similar topics to be addressed across the sample. The design of the interview schedule will be informed by our study aims, our previously published scoping review, and the expertise of team members; it will then be tested prior to use. Furthermore, the development of the interview schedule will be iterative; insights from earlier interviews may inform additions or amendments to the interview schedule in the later interviews. A flexible approach will be used to ensure that related subjects of importance can be raised. Interviews will be digitally recorded and transcribed verbatim and the content anonymized.
We will use inductive reflexive thematic analysis [
Integrated care with co-ordination, continuous health, and social input is set out by the SELFIE (Sustainable intEgrated care models for multi-morbidity: delivery, FInancing and performancE) framework [
A “virtual Delphi panel” will be established. Participants will be invited to join, including experts from health and social care, service managers, researchers, caregivers, patients, and database managers. We will convene a panel of >20 members. A purposive sampling approach will be used to recruit the panel.
A modified Delphi technique [
Discussion among panelists related to the potential clustering of specific variables will be guided and structured by the SELFIE conceptual framework. In particular, the extent to which variables are applicable within existing databases or obtainable through new health and social care data linkages will be considered in detail by the panel. Initially, participants will be supplied with a ranked list of variables generated from our preliminary review, followed by discussion and the qualitative study described above. Then, similar ideas emerging from these discussions will be grouped. Potentially relevant variables will be collated by the research team, fed back, and subsequently rated or ranked by panelists at the next round (phase II), with a “free text” option available for clarification. The most highly rated variables will be taken forward (phase III).
The panelists in round 1 will make their initial judgments individually without any interaction with other panelists, and these “ratings” will be fed into subsequent rounds. Web-based interactions with other panelists will occur during the deliberation rounds of the Delphi panel, a process spanning 1 to 2 days. At each stage, researchers experienced in the Delphi method will moderate the panel. The research team will take notes of these discussions to track the decision-making process and determine how and why specific decisions are reached. No attempt will be made by researchers to hasten discussion or compel the panel to reach a consensus.
We will use the Clinical Practice Research Database (CPRD) GOLD and Aurum [
CPRD GOLD and Aurum include 50 million registered GP patients with high levels of heterogeneity in ethnicity, deprivation, and morbidities. Primary care–linked records include Hospital Episode Statistics Admitted Patient Care (HES APC) data on hospital admissions, discharges, accident and emergency (A and E), and outpatients in England, socioeconomic status (Index of Multiple Deprivation [IMD] or Townsend score), and death data from the Office for National Statistics.
SAIL is a nationwide repository of routinely collected electronic data on health and social care in Wales, United Kingdom. It includes over 2 billion anonymized records linked with hospital admissions and primary care data [
The ELSA collects data from people aged over 50 years covering physical and mental health, well-being, finances, and attitudes around aging and how these change over time. The Health Survey for England is an annual survey that looks at changes in the health and lifestyle of people. Local area data sets allow local authority data linkage with health information using health determinants from the census and social determinants, such as wealth and the IMD (a score calculated for each participant's neighborhood based on social indices such as income, education, and employment).
The same variable definitions will be applied to all data sets, wherever possible, to ensure consistency and comparability of findings from the respective data sets.
Participants must be aged 18 years and over when entering the study. They must be diagnosed with MLT-C (defined by Guthrie et al, forthcoming) that included the following 59 conditions: stroke, coronary heart disease, heart failure, peripheral arterial disease, heart valve disorder, arrythmia, venous thromboembolic disease, aneurysm, hypertension, diabetes, Addison disease, cystic fibrosis, thyroid disease, chronic obstructive pulmonary disease, asthma, bronchiectasis, Parkinson disease, epilepsy, multiple sclerosis, paralysis, transient ischemic attack, peripheral neuropathy, chronic primary pain, solid organ cancer, hematological cancer, metastatic cancer, melanoma, benign cerebral tumors that can cause disability, dementia, schizophrenia, depression, anxiety, bipolar disorder, drug or alcohol misuse, eating disorder, autism, posttraumatic stress disorder, connective tissue disease, osteoarthritis, osteoporosis, gout, long-term musculoskeletal problems due to injury, chronic liver disease, inflammatory bowel disease, chronic pancreatic disease, peptic ulcer, chronic kidney disease, end-stage kidney disease, endometriosis, chronic urinary tract infection, anemia (including pernicious anemia, sickle cell anemia), visual impairment that cannot be corrected, hearing impairment that cannot be corrected, Meniere disease, HIV/AIDS, chronic Lyme disease, tuberculosis, postacute COVID-19, congenital disease, and chromosomal abnormality.
The participants within the data sets will be followed-up until the earliest occurrence of the following: developing the outcomes of interest, transfer out of the practice, death, practice stopping data contribution to the database, and end of data linkage to HES APC and the ONS.
Variables for inclusion in the clustering models will be generated by the qualitative study and Delphi, including but not limited to sociodemographic variables (eg, age, sex, ethnicity, and IMD), clinical variables (eg, blood pressure, cholesterol, medication use defined as repeat medication only), social needs (eg, social services, physiotherapy, or occupation health input), health and social care usage (eg, hospitalization, outpatient appointments), mortality and health care costs (eg, inpatient costs of admissions to hospital as a day case [
The initial core variables that we have included are as follows: all-cause and cause-specific mortality
Participant characteristics will be summarized with appropriate summary statistics. Using generalized linear models, we will describe multimorbidity rates and frailty over time. Survival models will be used to investigate all-cause and cause-specific mortality
A variety of data mining and ML methods will be used for data-driven knowledge elicitation regarding the concept of social care needs, patient social care need trajectories over time, outcomes associated with the trajectories, and interventions (modifiable exposures) that can be used to modify the trajectories and the respective the final outcomes. This information will be included in a report on intervention strategies and policies.
The sequence of ML-based analytic tasks will be as follows:
The variables defining the concept of social care needs, which are generated by the modified Delphi study, will be entered for clustering and cluster interpretation to discover naturally occurring classes of social care needs. The clustering approach will exploit the ability of the ML methods to process high-dimensional and high-volume data using data science pipelines composed of dimensionality reduction, unsupervised clustering, supervised learning, and model interpretation algorithms. These pipelines are further described below.
Using longitudinal data, patients’ social care need trajectories (ie, sequences of social care need class membership) will be composed, and these trajectories will be clustered with respect to outcomes of interest, including mortality, worsening frailty, accrual of critical LTCs of interest, and costs. For trajectory clustering, we will apply hierarchical clustering with custom distance measures based on the outcomes of interest using the HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) algorithm] in Python.
Trajectory clusters will be analyzed by experts for identifying clusters of interest (ie, clusters with undesirable outcomes). Further, intervention points will be identified aiming at trajectory modification so that trajectory outcomes can be positively modified.
Predictive and causal associations between exposures (these variables will be selected during the modified Delphi study) and trajectories will be modeled at the points for interventions previously selected. For predictive modeling, we will use the XGBoost algorithm in Python. For causal modeling, we will use directed acyclic graphs and linear models.
For the ML-based clustering in step 1 of the above procedure, we will develop, apply, and evaluate 2 ML clustering pipelines. Pipeline 1 is semiautomated based on shallow ML clustering using expert- crafted features calculated from raw data. Here, the expert is added by ML tools for data visualization by low-dimensional embedding. Pipeline 2 is a fully automated pipeline based on deep ANNs and explainable AI techniques, where the pipeline takes raw data as the input and provides clusters and interpretations as the output. In addition, prior to feeding data into the clustering pipelines, data will be preprocessed by rescaling of the numerical data (such as standardization or min-max scaling) or transformations for mixed categorical and numerical data (such as Gower transformation) or calculating dissimilarity measures for categorical data (such as the simple matching coefficient). The precise data preprocessing method will be finalized after a descriptive analysis of the selected variables.
Pipeline 1 uses semiautomated ML motivated mainly by the work of Becht et al [
In phase 1, we interactively assess the propensity of the data for clustering and topology using low- dimensional (2D and 3D) embeddings with parameterized t-SNE and UMAP [
The data in the low-dimensional (2D/3D) output space of t-SNE and UMAP will be clustered using HDBSCAN; the quality of these clusters will be quantitatively evaluated using measures for cohesion and separation (eg, sum of squared errors, silhouette coefficient, Calinski-Harabasz and Davies Bouldin Indexes). The observed natural clusters in the low-dimensional (2D/3D) space are not explicitly and directly interpretable, as UMAP and t-SNE perform highly nonlinear transformations and further interpretable ML methods will be used in the next phase to facilitate their interpretation.
Phase 2 selects data features (<20) and interpretable ML classification algorithms to fit models onto the natural clusters in the data. Algorithms with interpretable models are, for example, decision trees, rule learners, naive Bayes, k-nearest neighbors, generalized linear models, Gaussian mixture models, or ensembles of these, each generally performing differently depending on the density, shape, and separation boundaries of the data classes. We will output interpretations of individual clusters based on the derived decision boundaries of the best performing classification model learned for the clusters.
Pipeline 2 is fully automated and builds on the approach by Ding et al [
In phase 1, automated dimensionality reduction will be performed with a deep learning autoencoder neural network, followed by clustering in the lower-dimension space, based on the work of Xie et al [
The autoencoder is given the raw data (even if it is very high-dimensional) and features are successively generated automatically by the layers of the autoencoder, as well as low-dimensional embedding for clustering. Importantly, the mapping of input features (the high-dimensional space) and output features (the low-dimensional space) is captured by the weights of the encoding and decoding neural networks of the autoencoder; we can make transformations between the 2 spaces, which are needed for deriving cluster interpretations using explainable AI techniques in phase 2. Further, in the low-dimensional space, automated clustering will be performed using HDBSCAN, where metrics for cohesion and separation (as in pipeline 1) will be used to select the best clustering.
In phase 2, explainable AI (XAI) algorithms will be used for deriving cluster interpretations. We will use the XGBoost algorithm coupled with the SHAP (SHapley Additive exPlanations) XAI algorithm [
Beyond the approach and methods described above, and in the course of ongoing data analysis work, further refinement of our approach would be considered when addressing the specific concerns explained below.
Statistical dependences between the studied variables will be inferred from the data, which will then be appraised by a multidisciplinary team for the type of relation (causal or not) and factored in dynamic epidemiological models. Interpretable ML will be used for determining dependences between variables, where the techniques will include Bayesian network inference from data, SHAP estimation, and local surrogate model inference (local interpretable model-agnostic explanations [LIME]) over learned and possibly nonlinear, models. Dictionary learning of sparse coding techniques will also be researched for selecting candidate causalities.
This can naturally occur in high-dimensional spaces. However, some or many dimensions will not generally be mutually independent and there will be redundant dimensions to some degree with respect to a given learning task. This type of data sparsity will be addressed using dimensionality reduction techniques (eg, principal component analysis, UMAP) in the ML pipeline. Sparse data will also be interpreted by topological data analysis (eg, with the KeplerMapper tool) and manifold learning. For cases with sparse labeled data, but denser unlabeled data, semisupervised learning and the transfer learning pipeline will be specifically explored and designed for specific characteristics of the problem.
Time series and sequence learning for predicting the sequence of accrued conditions and cluster trajectory prediction will be performed with the long short-term memory (LSTM) and transformer ANN algorithms.
We assume that some known relations between the studied variables exist and are machine readable. These relations will be used to create a graph representation of the related variables. The graph will be used with graph neural networks for learning better latent representations for the downstream learning tasks or for inferring unobserved relations.
Dimensionality reduction, visualizations, clustering, and topological data analysis will be used for finding naturally occurring structures in the data that can be related to the studied phenomenon. We will initially apply readily interpretable ML techniques such as linear models, decision trees, and inferred Bayesian networks. Furthermore, nonlinear model learning on the labeled data (supervised learning with XGBoost or deep ANN) and interpretation learning (with SHAP and LIME) will be performed for identifying predictor variables that will be further assessed for their epidemiological meaning by a multidisciplinary team.
Addressing bias is critical to ensure model fairness and to ensure that predictions are not affected by an individual belonging to one of the groups defined by some sensitive attribute(s). An interdisciplinary approach is necessary to ensure that all researchers adopt principles of fairness and responsible AI practices [
Data set selection, wrangling, and transformation have the potential to remove or inadvertently introduce new bias in data. We will address this challenge by building on our team’s existing work in this area including previous algorithms [
Derived clusters and interpretations from the expert-driven and AI-supported clustering will be summarized with descriptive statistics. This will also allow inequities in care across clusters to be assessed using a proxy of area-level deprivation for socioeconomic status. Clusters will also be analyzed to identify whether they differ statistically between the 2 methods. Appropriate regression models (depending on data distribution) will be constructed to quantify the association between population clusters and outcomes. Finally, we will compare outcomes between the derived population clusters with respect to key covariates and between the clustering methods.
We will conduct segmentation via latent class analysis using Latent GOLD (Statistical Innovations) and data-driven analysis will be performed with R (version 4 or later) and Python (version 3 or later). The Delphi panel will be reconvened for algorithmic stewardship that permits an additional layer of quality control to discuss AI outputs in terms of safety, fairness, effectiveness, and practicality and to determine which clusters to take forward [
We will ensure FAIR (Findable, Accessible, Interoperable, and Reusable) stewardship of data, curation of models, and research integrity through robust governance, as the pipeline develops processes building on the blueprint outlined for the Social Data Foundation, which has been written by one of the authors of this protocol [
Findable data will be indexed and annotated with semantically rich metadata using shared terminologies. Data will be made accessible by publishing them to national data services via application programming interfaces and directly by humans or machines for integration into workflows, whereas interoperable data will be aligned with standards such as Health Level Seven and Fast Healthcare Interoperability Resources. Reusable data require clear licensing including any ethical, legal, and security requirements necessary for usage.
Ethical approval was granted from the University of Southampton Faculty of Medicine Research Committee (reference 67953).
This study is due to commence in October 2021 and we will aim to complete it by October 2023. The study received funding from the National Institute for Health Research.
Our research attempts to offer commissioners and policy makers reliable evidence on a new approach to manage MLTC-M. We will examine the potential of using ML methods to deliver insights into new disease clusters that consider health and social needs. Clusters will be profiled to evaluate differences in sociodemographic, clinical and treatment variables, comorbid disease patterns, and trajectories of disease progression. We will compare associations of disease trajectories with respect to outcomes and then conduct an intervention development phase to examine the feasibility of using advanced AI outputs to tailor the design of an intervention that supports the integration of care needs. This phase will develop the program theory and scope intervention content, as well as identify and address implementation, trust, adoption, and scalability issues to support rapid incorporation into existing service pathways. The generated evidence could provide a powerful tool for delivering holistic care and reducing the human cost and resource burden of MLTC-M.
In this mixed methods program of work, we will use multiple large national primary care databases alongside qualitative work and a modified Delphi method to identify clusters of MLTC-M populations based on their health and social needs. Understanding these clusters and their trajectories over time will, in turn, help develop evidence-based solutions. These will be aimed at supporting the delivery of interventions tailored to address the needs pertinent to each homogenous cluster. Our evidence will provide key knowledge on how to generate clusters based on health and social needs and how to quantify the impact of clusters on long-term health and costs.
For our qualitative work, we will primarily be carrying out telephonic or virtual interviews whereas the modified Delphi method is an exclusively internet-based one. It is likely that respondents will include those able to use virtual technologies for interviews and those able to use and access virtual services to complete the web-based Delphi. People with MLTC-M who are unable to use these technologies, such as the elderly, those with disabilities, or those from lower socioeconomic backgrounds who may not have access to internet-based services, will not be sufficiently represented in our sample. It is plausible that the richer qualitative data could be obtained through in-person interviews.
There are inherent limitations associated with the analysis carried out on secondary data. These data have been collected in the clinical setting and are not for research purposes. They will have variations in entries and coding that are dependent on individual clinicians. Incomplete, missing, and incorrectly coded records are likely to be limitations. The extent to which problem exists in our data and the impact on our findings will require thorough exploration. Although our cohort will include primary care populations from several large national databases across large geographical areas in England and Wales, it may not include a sufficiently diverse sample of people from varying ethnic and socioeconomic status, thus limiting generalizability.
Outputs from the research will offer commissioners and policy makers reliable evidence for a new approach to manage MLTC-M. Using a “whole person” approach could inform tailoring of interventions specific to each MLTC-M cluster. The evidence generated by this research has the potential to serve as a powerful tool for delivering holistic personalized care, thereby reducing the human cost and resource burden of MLTC-M.
accident and emergency
artificial intelligence
artificial neural network
Clinical Practice Research Database
English Longitudinal Study of Ageing
general practice
Hierarchical Density-Based Spatial Clustering of Applications with Noise
Index of Multiple Deprivation
local interpretable model-agnostic explanations
machine learning
multiple long-term health conditions (multimorbidity)
Secure Anonymised Information Linkage
Sustainable intEgrated care models for multi-morbidity: delivery, FInancing and performancE)
SHapley Additive exPlanations
t-Distributed Stochastic Neighborhood Embedding
Uniform Manifold Approximation and Projection
explainable artificial intelligence
We would like to thank Ms Firoza Davies for her contribution as a patient and public representative. HDM is an Academic Clinical Lecturer funded by the National Institute for Health Research. AF receives support from NIHR Oxford Biomedical Research Centre. JS was supported by an MRC fellowship (grant MR/T027517/1). FZ has received funding from the National Institute for Health Research - Applied Research Collaboration East Midlands, NIHR Leicester Biomedical Research Centre. This paper reports independent research funded by the National Institute for Health Research (Artificial Intelligence for Multiple Long-Term Conditions (AIM) (grant NIHR202637). The views expressed in this publication are those of the author(s) and not necessarily those of the National Health Service, the National Institute for Health Research, or the Department of Health and Social Care
None declared.