The Diabetes Location, Environmental Attributes, and Disparities Network: Protocol for Nested Case Control and Cohort Studies, Rationale, and Baseline Characteristics

Background Diabetes prevalence and incidence vary by neighborhood socioeconomic environment (NSEE) and geographic region in the United States. Identifying modifiable community factors driving type 2 diabetes disparities is essential to inform policy interventions that reduce the risk of type 2 diabetes. Objective This paper aims to describe the Diabetes Location, Environmental Attributes, and Disparities (LEAD) Network, a group funded by the Centers for Disease Control and Prevention to apply harmonized epidemiologic approaches across unique and geographically expansive data to identify community factors that contribute to type 2 diabetes risk. Methods The Diabetes LEAD Network is a collaboration of 3 study sites and a data coordinating center (Drexel University). The Geisinger and Johns Hopkins University study population includes 578,485 individuals receiving primary care at Geisinger, a health system serving a population representative of 37 counties in Pennsylvania. The New York University School of Medicine study population is a baseline cohort of 6,082,146 veterans who do not have diabetes and are receiving primary care through Veterans Affairs from every US county. The University of Alabama at Birmingham study population includes 11,199 participants who did not have diabetes at baseline from the Reasons for Geographic and Racial Differences in Stroke (REGARDS) study, a cohort study with oversampling of participants from the Stroke Belt region. Results The Network has established a shared set of aims: evaluate mediation of the association of the NSEE with type 2 diabetes onset, evaluate effect modification of the association of NSEE with type 2 diabetes onset, assess the differential item functioning of community measures by geographic region and community type, and evaluate the impact of the spatial scale used to measure community factors. The Network has developed standardized approaches for measurement. Conclusions The Network will provide insight into the community factors driving geographical disparities in type 2 diabetes risk and disseminate findings to stakeholders, providing guidance on policies to ameliorate geographic disparities in type 2 diabetes in the United States. International Registered Report Identifier (IRRID) DERR1-10.2196/21377


Introduction
Background An estimated 10.5% of the US population has diabetes, and these 34 million individuals are at an increased risk for coronary artery disease, cerebrovascular disease, and other complications [1,2]. Approximately 90% to 95% of people with diabetes have type 2 diabetes (T2D) [1]. Another 88 million individuals have prediabetes, defined as having elevated glucose levels above normal but below the threshold for diabetes, and are at elevated risk of developing T2D [3,4]. Diabetes prevalence and incidence vary substantially by geographic region [5][6][7]. In 2013, there was a six-fold difference between counties with the lowest and highest diabetes prevalence [6]. A large body of literature links community and environmental factors (hereafter referred to as community factors) to T2D and obesity, one of the risk factors for T2D [8][9][10][11][12][13][14][15]; however, the mechanisms for these links remain poorly understood. Moreover, there are inconsistencies in this body of literature. Identifying the community factors driving T2D disparities and the pathways through which these factors influence T2D is essential to informing geographically targeted policy interventions that reduce the risk of T2D and related outcomes.
Among the challenges to creating a cohesive body of research is a lack of consistent approaches to conceptualizing and operationalizing the geographic area in which community factors are thought to be relevant to health [19][20][21][22]. Furthermore, the size and boundaries of spatial scales most relevant to health may vary according to community type (eg, across the gradient from urban to rural). Community type is also an important consideration in measurement development, as measurement of the same community factors may require different approaches [23]. For example, car ownership may be a basic necessity for individuals living in rural areas but more of a luxury for individuals living in urban areas with good public transportation options. Thus, car ownership may work differently as an indicator of the neighborhood socioeconomic environment (NSEE) in urban versus rural areas [24]. This differential item functioning may contribute to inconsistencies observed in the literature as the same measure (eg, proportion who own a car) could hold different meanings in different community types.
Community and health research is also vulnerable to structural confounding, which occurs when individual and contextual factors strongly predict residence in a certain community [25,26]. In the presence of structural confounding, certain measures, owing to social sorting, are largely nonoverlapping, resulting in an inability to examine their independent influences on the outcome of interest. For example, in some settings, the distribution of persons across categories of the NSEE and racial residential segregation may reveal a lack of comparable groups across key strata, resulting in analytical challenges of nonpositivity that prohibit causal contrasts across levels of exposure [26]. Finally, capturing the complex interactions among multiple community factors on health can be challenging [27]. As a result, previous research has largely evaluated co-occurring community factors in isolation.

Objectives
The increasing availability of longitudinal, individual-level data from electronic health record (EHR) networks [28] and cohort studies, coupled with advances in geographic information systems (GISs), provides new opportunities to examine the effects of community factors on health. In 2017, the Diabetes LEAD (Location, Environmental Attributes, and Disparities) Network was established to identify the contributions of modifiable community factors on T2D risk. The Network includes researchers from 4 academic institutions who collaborate to address the methodological challenges previously described to investigate a range of community factors across the United States. The Network aims to guide policy decision making to reduce the burden of T2D across the United States. This paper aims to describe the Diabetes LEAD Network, its study populations, and the methodologies used to investigate the community factors that are associated with T2D onset and related outcomes.

Network Overview
The Diabetes LEAD Network is a research collaboration of 4 academic centers: Drexel University, Geisinger and Johns Hopkins University (G/JHU), New York University School of Medicine (NYU), and the University of Alabama at Birmingham (UAB). The Centers for Disease Control and Prevention (CDC) funded the Network to bring together institutions with diverse but complementary expertise and a rich array of data assets. Three study sites-G/JHU, NYU, and UAB-use longitudinal data, such as EHRs, administrative claims, and survey data on distinct populations and geographies in the United States (Tables  1-3; Figures 1-3). Drexel, the data coordinating center (DCC), is leading the development of a set of harmonized community factors, health outcomes, and analysis plans (Tables 4 and 5) that will be applied to each study site's cohort and geography.
Each site has its own set of study aims that examine community factors and T2D outcomes, including T2D onset, obesity, and other cardiometabolic conditions. Working collaboratively, the study sites and the CDC also developed a shared set of aims that complement site-specific aims (Textbox 1). We first describe the shared Network aims and then describe the site-specific aims.

Description Spatial scale Data source and years Domain
Area-level index derived from a z-score sum of indicators of the community's social and economic characteristics [29]: percentage of males and females with less than a high school education, percentage of males and females unemployed, percentage of households earning less than US $30,000 per year, percentage of households in poverty, percentage of households on public assistance, and percentage of households with no cars   Geisinger and Johns Hopkins University: • To evaluate associations of chronic environmental contamination [30] (eg, abandoned coal mine lands); the food environment; the physical activity environment; land use environment, the natural environment (eg, greenness); community type (eg, urban/rural); and community socioeconomic deprivation (CSD) with type 2 diabetes (T2D) onset and control and coronary heart disease (CHD) onset within communities.
• To evaluate mediating pathways (eg, food, physical activity environment) between the neighborhood socioeconomic environment and T2D onset (through LEAD Network Aim 1).
• To evaluate mediating pathways (eg, stress, health behaviors) between community factors and T2D control among 1000 individuals with T2D living in 40 communities.

•
To evaluate potential effect modification by key individual (eg, age, Medical Assistance) and community factors (eg, CSD) of relations between community factors and T2D and CHD within communities.
New York University School of Medicine: • Using public-use data sources, determine independent and joint association between novel community measures and county-level prevalence of outcomes (diabetes, obesity, and diabetes-obesity prevalence discordance profile), controlling for other county measures (eg, population density, socioeconomic status, and demographic distributions).
• Measure the impact of modifiable community characteristics such as food and housing environments on (a) risk of a new T2D diagnosis or (b) being obese (BMI≥30 kg/m 2 ) in a large cohort of Veterans Affairs patients, adjusting for community and individual-level covariates in multilevel regression models.
• Use mediation analysis to examine mediating pathways between modifiable community contexts and T2D.
University of Alabama at Birmingham: • To determine the association of community-level social determinants of health with the prevalence and incidence of T2D and hypertension, separately.
• To determine if pharmacologic treatment patterns and hospitalization rates vary by community-level social determinants of health for those with T2D and hypertension, separately.
• To determine if awareness and treatment of T2D and risk of cardiovascular complications varies by community-level and individual-level social determinants of health.

Network Aims
The Network aims to evaluate the association of community factors and T2D outcomes (aims 1 and 2) and to evaluate and address the previously described methodological challenges of community and health research (aims 3 and 4): 1. Evaluate the mediation of the association of NSEE with new-onset T2D. This aim reflects a conceptual framework (Figure 4) that proposes that NSEE influences T2D onset through other community pathways, including the food, physical activity (fitness and leisure) environments, and exposure to fine particulate matter (≤2.5 µ, particulate matter 2.5 ). 2. Define and test effect modifiers (eg, age, sex, race) of the association of NSEE with new-onset T2D. 3. Assess the differential item functioning of community measures by geographic region and community type. 4. Evaluate the impact of the spatial scale used to measure community factors (eg, buffer, census tract, county) on associations with new-onset T2D.

Network Populations and Geographic Coverage
The Diabetes LEAD Network draws from individuals living in all 50 US states (Figures 1-3). The G/JHU participants were selected from among 1.6 million individuals in the Geisinger EHR, spanning 37 counties in central and northeastern Pennsylvania (  [34]. To assess T2D onset, participants without T2D at baseline and who completed the follow-up in-home examination (n=11,199) will be evaluated. Patients were not invited to comment on the cohort development or study design.

Network Data Sources and Measurement
The DCC is leading the development of harmonized, Network-wide approaches to measuring community factors of interest and T2D outcomes. To develop measures of community factors (Table 4), the DCC is using archival data available at the national level, including publicly available data (eg, US Census) and data elements previously created for the Retail Environment and Cardiovascular Disease (RECVD) study. The RECVD study has longitudinal measures of food, fitness, and social establishments based on the National Establishment Time Series (NETS), a data source that includes information on more than 58 million US business establishments from 1990 to 2014. For each community factor, the DCC has partnered with a study site with relevant expertise to make decisions regarding data sources, spatial scale, exposure assignment, and approach to measurement.
The DCC is applying a range of measurement techniques to define community factors, including data reduction and measurement models. Measurement development is stratified by community type at the census tract level using a modification of the Rural-Urban Commuting Area (RUCA) from the US Department of Agriculture developed by the Network [35]. After collapsing the original 10 RUCA categories into 3, the DCC further divided census tracts within urbanized areas into 2 categories based on land area, resulting in 4 community-type categories that reflect distinct typologies along the rural-urban continuum.
To the extent possible, the Network is harmonizing approaches to measure T2D onset (Table 5) and diabetes-related outcomes. G/JHU and NYU have worked together to develop EHR-based algorithms based on their previous work [36,37] and diagnosis criteria from the American Diabetes Association [38], using a combination of diagnosis codes, medications, and laboratory measures. With coordination from the DCC, the sites are also standardizing approaches to measure potential confounders, mediators, and effect modifiers.

Network Analyses
For each Network-wide aim, the study sites will conduct analyses among their study populations based on a common analytic plan. The DCC is coordinating the development of the analytic plan, harmonizing analytical approaches, including the selection of confounding, mediating, and modifying variables of interest, model building, and model diagnostics. Sites will conduct site-specific sensitivity analyses that include relevant data elements that may not be available Network-wide. This approach allows us to examine consistency in results while leveraging the unique data available at individual sites.
For aims 1 and 2, the Network will employ methods to account for group-level and individual-level data, including multilevel models, Bayesian approaches, and generalized estimating equation models. The Network will conduct causal mediation analysis for aim 1 [39]. For aim 2, we will evaluate effect modification through the inference of interaction terms, creating cross-products between our contextual domains of interest and a predetermined set of individual-and community-level variables, such as age, sex, and race. To guide model development for these aims, sites are developing causal diagrams to formulate and test theoretically based pathways, identify potential confounding influences, and account for potential interaction between measures ( Figure 4) [40,41]. To assess spatial residual autocorrelation, the Network will calculate I statistics by Moran (local and global) [42] and use modeling approaches that account for spatial residual autocorrelation, if needed. The Network will conduct sensitivity analyses to evaluate how approaches to measurement of outcomes and community factors impact observed associations.
For aim 3, the Network is exploring strategies to evaluate and address nonpositivity, including propensity scores [43], stratification by community type for all analyses, and latent profile analysis to evaluate community typology [44]. For aim 4, we are evaluating community factors at multiple spatial scales (eg, census tract, network buffer around population-centroid). We will compare the associations of community factors measured at different scales with T2D to assess the influence of the modifiable areal unit problem [21]. The study sites also have site-specific aims, as described below.

Geisinger and Johns Hopkins University
The Environmental Health Institute, a joint collaboration between Geisinger, Johns Hopkins Bloomberg School of Public Health, and Johns Hopkins School of Medicine, is evaluating the influence of community factors on T2D onset and control and cardiometabolic outcomes in Pennsylvania (Textbox 1) using a combination of primary and secondary data collection.
The team is conducting the study among patients from Geisinger, a health system serving 1.6 million patients in central and northeastern Pennsylvania. To be eligible for study, individuals had to reside in one of 37 counties in Geisinger's service area and have at least two Geisinger primary care visits from 2006 to 2016 (Table 1). The Geisinger primary care patient population represents the age, sex, and racial and ethnic distribution of the general population of the region [32]. The region's population is residentially stable, with an annual out-migration rate of approximately 1% in all but two counties according to US Census Bureau data.
G/JHU is evaluating the main effects of 8 community factors: NSEE, food environment, fitness environment, leisure-time physical activity environment, land use environment, greenness, blue space (aquatic environments such as coasts, lakes, and rivers), and chronic environmental contamination [45]. G/JHU has previously reported associations between these factors and obesity and glycated hemoglobin (HbA 1c ) [46][47][48][49][50][51]. G/JHU is using data from publicly available sources (eg, US Census, American Community Survey [ACS], Moderate Resolution Imaging Spectroradiometer from the National Aeronautics and Space Administration's Terra satellite, Pennsylvania Department of Transportation, TeleAtlas) and commercial data to generate measures for these factors. For the Network aims, the team is working with the DCC to guide decisions on the land use and physical fitness environment measures.
For site-specific aims, G/JHU outcomes include T2D onset, T2D control, and cardiometabolic outcomes (Textbox 1). G/JHU is conducting 2 types of studies to evaluate the associations between community factors and T2D outcomes and mediation and moderation of these associations. EHR-based analyses will be used for both Network-and site-specific aims. A primary data collection study will be used for additional site-specific aims, as much of the data collected in this study will be uniquely available at the G/JHU site (Multimedia Appendix 1).
G/JHU is using a mix of nested case-control and retrospective cohort study designs to achieve site-specific aims, using logistic and linear regression as appropriate. To account for correlation due to both place and space, G/JHU is using generalized estimating equations and multilevel modeling. To examine mediators of the association between NSEE and T2D onset (Network Aim 1), Geisinger will apply a nested case-control design and formal mediation models that include T2D onset cases (n=15,888) matched to controls (n=79,435) on age, sex, and year of encounter.

New York University School of Medicine
Investigators at NYU are examining the relationship between modifiable community factors and risk for T2D and obesity using a retrospective cohort assembled through EHR data from the Veterans Affairs Corporate Data Warehouse, a national repository of clinical and administrative data. The 2 primary exposures of interest are the food and housing environments. The assembled cohort includes more than 6 million veteran patients who were diabetes-free upon entry into the cohort from 2008 to 2016. Entry eligibility includes 2 primary care visits with no indication of diabetes within the 5 years before cohort entry, with at least two follow-up visits at least 30 days apart during the study period (2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018). The population has a well-documented high incidence of diabetes [36], providing adequate variation in contexts and outcomes to examine community factors in relation to T2D incidence.
For site-specific analyses, NYU's primary community factors of interest are the food and housing environments. Food environment metrics include 2 absolute measures and 2 relative measures created from the RECVD data (Table 4). The NYU team also has store-level Nielsen Retail Scanner data from 2006 to 2014, which will be used to examine potential mechanistic pathways, including whether risks associated with living in select food environments are partially mediated through per capita sales of sugar-sweetened beverages. The NYU team is guiding Network decisions on the food environment measure development and harmonization, in collaboration with the DCC. They are also engaging in site-specific analyses to examine the influence of housing affordability per ACS and Veterans Affairs data on T2D risk.
NYU study outcomes include diabetes incidence and control as well as obesity prevalence and incidence. Outcome data are extracted from EHRs, capturing demographic, clinical, and utilization data. To ensure participants in the cohort do not have diabetes at cohort entry, individuals with any diabetes (type 1 or type 2) International Classification of Disease version 9 or 10 (ICD-9/10) code or elevated HbA 1c at enrollment are excluded. Time-to-event analyses (Cox proportional hazards models with frailty to account for clustering within a community) will be used to examine the main effects of the food environment on T2D risk and its role in mediating the association between NSEE and T2D risk. Person-time is calculated as the date of a censoring event (diabetes diagnosis, death, loss to follow-up, or end of study period) minus the date of cohort entry. The date of death is obtained from the Veterans Affairs Vital Status and Beneficiary Identification Records Locator. Loss-to-follow-up is defined as no Veterans Affairs encounter for more than 2 years but patients can re-enter the cohort if they meet entry criteria again.

University of Alabama at Birmingham
The UAB site is investigating the association of NSEE with a greater burden of T2D and cardiovascular risk, particularly in southeastern United States. To address site-specific and Network-wide research questions, UAB is leveraging resources from the REGARDS study [33]. The REGARDS study is a longitudinal, population-based closed cohort study of 30,239 adults aged 45 years and older at baseline (2003)(2004)(2005)(2006)(2007), designed to identify factors associated with higher stroke mortality. The study was designed to oversample non-Hispanic Black adults and residents of the Stroke Belt region, with 56% of the sample selected from the Stroke Belt and the remaining 44% selected from the other 40 contiguous states. Demographics, medical history, and lifestyle factors were assessed at baseline and an in-home physical exam was performed with blood and urine collection. Follow-up is ongoing every 6 months to assess vital status and hospitalizations and obtain medical records for adjudication of possible cardiovascular events. A second in-home physical exam was completed between 2013 and 2016.
For site-specific analyses, the primary exposure includes the NSEE as assessed using principal component analysis for measures of community-level income or wealth, education, housing, health systems or services, employment, social environment, and physical environment. The data to assess these characteristics include both publicly available databases (eg, US Census) and commercial databases (eg, Dun & Bradstreet). The primary outcomes are incident T2D and cardiovascular outcomes. Incident T2D will be assessed among 11,199 REGARDS study participants without prevalent T2D at baseline and who completed the follow-up in-home physical exam during which objective measurements (eg, glucose, use of medications) were collected (Table 2). Cardiovascular outcomes include hypertension (ie, mean blood pressure140/90 mm Hg or use of hypertension medications) and expert adjudicated clinical events (ie, coronary heart disease, stroke).
Separate from the analysis of the REGARDS study data, UAB will utilize Medicare administrative claims data to investigate the association of NSEE with T2D and hypertension incidence. These data consist of several federal health care insurance programs that cover adults aged 65 years and older and younger individuals who are disabled or have end-stage renal disease. Broadly, Medicare Part A covers hospital services, Medicare Part B covers outpatient and physician services, and Medicare Part D covers prescription drugs. UAB will use the 5% random sample of Medicare claims data available from 1999 to 2015 to investigate community-level determinants of T2D incidence, diabetes hospitalizations, and treatment patterns. An overview of the Medicare sample population and diabetes definitions used for site-specific analyses is provided in Multimedia Appendix 2. Statistical approaches include generalized linear models and spatial generalized linear mixed models.

Drexel Data Coordinating Center
The Drexel DCC provides the study sites with project coordination and statistical support, including advanced methodological and analytic approaches to data analyses driven by the Network aims and heterogeneous data from each study site. The expertise needed for this work is reflected in the JMIR Res Protoc 2020 | vol. 9 | iss. 10 | e21377 | p. 12 http://www.researchprotocols.org/2020/10/e21377/ (page number not for citation purposes) backgrounds of DCC team members, including biostatisticians and epidemiologists from the Dornsife School of Public Health and the Drexel Urban Health Collaborative (UHC), postdoctoral fellows, doctoral-level biostatistics students, data analysts and managers, and GIS experts. Through exploratory analytic work, including principal component analysis, exploratory factor analysis, GIS analysis and mapping, and correlation analysis of contextual indices against individual variables, the DCC supports the Network's collective decision making around defining exposure metrics for addressing Network aims. The relationship with the UHC also allows for access to data from RECVD and other sources; provides support for GIS methods, data distribution, and storage; and provides access to data engineering experts. Furthermore, the UHC has a Policy and Outreach Core, which helps provide guidance on disseminating LEAD Network findings.

Results
The Network has developed metrics for the community factors of interest: NSEE, food establishment, physical fitness establishment, leisure-time physical activity, and land use environments ( Table 4). The Network has created these measures using data that are consistently available and contextually applicable to all geographies in the contiguous United States. This underscores the importance of the Network's development of a method for categorizing community types for stratified evaluation of community factors with T2D onset. With harmonized measures, the Network is poised to compare findings across the varying study sites.
The Network has reported findings based on work from the initial years of funding. Preliminary results have been presented at annual meetings of the Society for Epidemiologic Research, the American Diabetes Association, and the American Public Health Association [52,53]. The Network recently published a paper describing county-level determinants of diabetes status in the United States from 2003 to 2012 [54]. The NYU team published a paper describing the impact of changes in the built and social environment on BMI in US counties using data from the Behavioral Risk Factor Surveillance System [13]. Additional manuscripts are in press or in development.

Strengths and Limitations
The Diabetes LEAD Network leverages a breadth of expertise and data to advance knowledge regarding modifiable community risk factors for T2D onset and related outcomes. The Network brings strengths to its collective mission to provide scientific evidence for targeted interventions and policies. First, the sites provide the Network with community data sources that collectively ensure widespread geographic coverage and variation of community types across a rural-to-urban spectrum. It was of particular importance to ensure representation by rural communities, since CDC reports that diabetes is 17% more prevalent in rural than urban areas [55]. Furthermore, understanding the association between community and health requires the assessment of a heterogeneous set of communities [12,56]. Each of the study sites contributes unique data sources to achieve this goal. The NYU cohort spans the nation, the UAB REGARDS study cohort offers a national study with in-depth data from a high-risk region (Stroke Belt), and the G/JHU population is a regionally representative sample with high rural representation and primary data collection.
A second contribution of the Network is the development of measures of 6 community factors (Table 4) to be examined across diverse geographies and community types. These measures are being developed with consideration of community types, examining community factors within strata of different community types defined along a rural-urban spectrum to avoid potential differential item functioning and nonpositivity [23]. In addition, Network investigators are examining individual and joint associations to better understand how these community factors work in concert to contribute to excess T2D risk. With access to individuals' residential and commercial addresses, the Network is evaluating spatial scales of various types and sizes to better understand the impact of scale on the findings.
Third, the range of expertise across institutions allows the Network to address methodological challenges common to community and health research [20]. In addition, the Network is advancing methods for conducting research on the role of community and health using data obtained from EHR systems, including extracting historical residential addresses to allow for time-varying exposure estimation. Finally, the Network is harmonizing community factor definitions and analytical approaches to facilitate comparable analyses and replicating analyses across 3 different study populations. This effort to harmonize approaches across multiple settings and populations will advance both the field of community and health research and generate data needed to guide evidence-based policies for T2D prevention in the United States.
There are some limitations to the Diabetes LEAD Network research portfolio. First, although access to longitudinal data will mitigate issues of temporality, the available data do not allow for investigation of early life exposures (eg, childhood) that may influence T2D risk [57]. Second, while the diversity of populations and geographies across the study sites is advantageous to expand the generalizability of findings and include previously underrepresented settings (ie, rural), it complicates comparison of results across sites. However, by harmonizing measurement and analytical approaches the Network will be well positioned to pinpoint reasons for any potential conflicting results that arise. Finally, despite employing advanced analytic approaches, the studies are all observational in design; thus, they are potentially constrained with respect to causal inference due to the risk of residual confounding and neighborhood self-selection [58,59]. The Network will consider methodological approaches such as propensity scores to address this limitation [43].

Conclusions
T2D is a leading cause of morbidity in the United States, with select populations, often defined by geography, affected by a disproportionate burden of disease. The Diabetes LEAD Network identifies modifiable community factors that influence geographic disparities in T2D risk across diverse communities and identifies policy levers to ameliorate these disparities.