
Jon Duke
VerifiedGeorgia Institute of Technology · Computer Science
Active 1989–2026
About
Dr. Jon Duke is a Principal Research Scientist affiliated with the Georgia Tech Research Institute (GTRI) and the College of Computing at Georgia Tech. His research focuses on large-scale observational research in healthcare, clinical natural language processing, phenotyping, and drug safety. He has led over $21 million in funded research for industry, government, and foundation partners. Dr. Duke's work aims to advance techniques for identifying patients of interest from diverse data sources, with applications spanning research, quality, and clinical domains. He led the Merck-Regenstrief Partnership in Healthcare Innovation and was a founding member of OHDSI, an open-source international health data analytics collaborative. His research contributions include numerous peer-reviewed publications, and his work has been featured in the lay media such as the New York Times, NPR, and MSNBC. Dr. Duke completed his medical degree at Harvard Medical School and earned a master's degree in human-computer interaction at Indiana University.
Research topics
- Computer science
- Medicine
- Data science
- Pharmacology
- Internal medicine
Selected publications
The GenAI Generation: Student Views of Awareness, Preparedness, and Concern
2026-02-20 · 1 citations
articleOpen accessGenerative Artificial Intelligence (Gen AI) is revolutionizing education and workforce development, profoundly shaping how students learn, engage, and prepare for their future. Outpacing the development of uniform policies and structures, Gen AI has heralded a unique era and given rise to the Gen AI Generation, a cohort of individuals whose development has been increasingly shaped by the opportunities and challenges Gen AI presents during its widespread adoption within society. This study examines higher education students' perceptions of Gen AI through a concise survey with optional open-ended questions, focusing on their awareness, preparedness, and concerns. “Readiness” appears increasingly tied to exposure to Gen AI through one's coursework. Students with greater curricular exposure to Gen AI tend to feel more prepared, while those without more often express vulnerability and uncertainty, highlighting a new and growing divide that goes beyond traditional disciplinary boundaries. Evaluation of more than 250 responses, with over 40 percent providing detailed qualitative feedback, reveals a core dual sentiment: while most students express enthusiasm for Gen AI, an even greater proportion voice a spectrum of concerns about ethics, job displacement, and the adequacy of educational structures given the highly transformative technology. These findings offer critical insights into how students view the potential and pitfalls of Gen AI for future career impacts.
JAMIA Open · 2024-12-26
articleOpen accessSenior authorObjective: The resurgence of syphilis in the United States presents a significant public health challenge. Much of the information needed for syphilis surveillance resides in electronic health records (EHRs). In this manuscript, we describe a surveillance platform for automating the extraction of EHR data, known as SmartChart Suite, and the results from a pilot. Materials and Methods: The SmartChart Suite framework has been developed in compliance with the HHS Health IT Alignment Policy. The platform's major functionalities are (1) data retrieval; (2) logical evaluation; (3) standardized data storage; and (4) results display. The SmartChart Suite was deployed in September 2023 at the Grady Health System in Atlanta, Georgia. We established a cohort of likely syphilis patients, randomly selected 50 medical records for manual and automated chart review, and analyzed the results. Results: The SmartChart Suite was successfully deployed and integrated with the Epic EHR system at Grady. The overall performance results were precision of 97.6%, recall of 100.0%, and F-Score of 98.8. Discussion: Automated abstraction of EHR data has significant potential to improve public health surveillance and case investigation processes while reducing the resource burden on health departments and reporters. The SmartChart Suite comprises a flexible open-source solution for registry development and maintenance across a wide spectrum of conditions and use cases. Conclusion: SmartChart Suite demonstrates the potential of automated chart abstraction to support disease surveillance. HHS-compliant open-source tools such as SmartChart Suite can support more efficient human review by providing accurate and relevant data for critical public health activities.
Real-World Data Versus Probability Surveys for Estimating Health Conditions at the State Level
Journal of Survey Statistics and Methodology · 2024-09-28 · 2 citations
articleOpen accessGovernment statistical offices worldwide are under pressure to produce statistics rapidly and for more detailed geographies, to compete with unofficial estimates available from web-based big data sources or from private companies. Commonly suggested sources of improved health information are electronic health records (EHRs) and medical claims data. These data sources are collectively known as real world data (RWD) because they are generated from routine health care processes, and they are available for millions of patients. It is clear that RWD can provide estimates that are more timely and less expensive to produce- but a key question is whether or not they are very accurate. To test this, we took advantage of a unique health data source that includes a full range of sociodemographic variables and compare estimates using all of those potential weighting variables, versus estimates derived when only age and sex are available for weighting (as is common with most RWD sources). We show that not accounting for other variables can produce misleading, and quite inaccurate, health estimates.
PubMed · 2023-01-01 · 1 citations
articleOpen accesses and LOINC DO codes. Additionally, we developed a standardization pipeline that automatically maps clinical note titles from multiple sites to suitable LOINC DO codes, without accessing the content of clinical notes. The pipeline can be initialized with different large language models, and we compared the performances between them. The results showed that our automated pipeline achieved an accuracy of 0.90. By comparing the manual and automated mapping results, we analyzed the coverage of LOINC DO in describing multi-site clinical note titles and summarized the potential scope for extension.
Hypertension · 2021-03-29 · 42 citations
articleOpen accessEvidence for the effectiveness and safety of the third-generation β-blockers other than atenolol in hypertension remains scarce. We assessed the effectiveness and safety of β-blockers as first-line treatment for hypertension using 3 databases in the United States: 2 administrative claims databases and 1 electronic health record–based database from 2001 to 2018. In each database, comparative effectiveness of β-blockers for the risks of acute myocardial infarction, stroke, and hospitalization for heart failure was assessed, using large-scale propensity adjustment and empirical calibration. Estimates were combined across databases using random-effects meta-analyses. Overall, 118 133 and 267 891 patients initiated third-generation β-blockers (carvedilol and nebivolol) or atenolol, respectively. The pooled hazard ratios (HRs) of acute myocardial infarction, stroke, hospitalization for heart failure, and most metabolic complications were not different between the third-generation β-blockers versus atenolol after propensity score matching and empirical calibration (HR, 1.07 [95% CI, 0.74–1.55] for acute myocardial infarction; HR, 1.06 [95% CI, 0.87–1.31] for stroke; HR, 1.46 [95% CI, 0.99–2.24] for hospitalized heart failure). Third-generation β-blockers were associated with significantly higher risk of stroke than ACE (angiotensin-converting enzyme) inhibitors (HR, 1.29 [95% CI, 1.03–1.72]) and thiazide diuretics (HR, 1.56 [95% CI, 1.17–2.20]). In conclusion, this study found many patients with first-line β-blocker monotherapy for hypertension and no statistically significant differences in the effectiveness and safety comparing atenolol with third-generation β-blockers. Patients on third-generation β-blockers had a higher risk of stroke than those on ACE inhibitors and thiazide diuretics.
Journal of Medical Internet Research · 2021-06-17 · 14 citations
articleOpen accessBACKGROUND: Public health reporting is the cornerstone of public health practices that inform prevention and control strategies. There is a need to leverage advances made in the past to implement an architecture that facilitates the timely and complete public health reporting of relevant case-related information that has previously not easily been available to the public health community. Electronic laboratory reporting (ELR) is a reliable method for reporting cases to public health authorities but contains very limited data. In an earlier pilot study, we designed the Public Health Automated Case Event Reporting (PACER) platform, which leverages existing ELR infrastructure as the trigger for creating an electronic case report. PACER is a FHIR (Fast Health Interoperability Resources)-based system that queries the electronic health record from where the laboratory test was requested to extract expanded additional information about a case. OBJECTIVE: This study aims to analyze the pilot implementation of a modified PACER system for electronic case reporting and describe how this FHIR-based, open-source, and interoperable system allows health systems to conduct public health reporting while maintaining the appropriate governance of the clinical data. METHODS: ELR to a simulated public health department was used as the trigger for a FHIR-based query. Predetermined queries were translated into Clinical Quality Language logics. Within the PACER environment, these Clinical Quality Language logical statements were managed and evaluated against the providers' FHIR servers. These predetermined logics were filtered, and only data relevant to that episode of the condition were extracted and sent to simulated public health agencies as an electronic case report. Design and testing were conducted at the Georgia Tech Research Institute, and the pilot was deployed at the Medical University of South Carolina. We evaluated this architecture by examining the completeness of additional information in the electronic case report, such as patient demographics, medications, symptoms, and diagnoses. This additional information is crucial for understanding disease epidemiology, but existing electronic case reporting and ELR architectures do not report them. Therefore, we used the completeness of these data fields as the metrics for enriching electronic case reports. RESULTS: During the 8-week study period, we identified 117 positive test results for chlamydia. PACER successfully created an electronic case report for all 117 patients. PACER extracted demographics, medications, symptoms, and diagnoses from 99.1% (116/117), 72.6% (85/117), 70.9% (83/117), and 65% (76/117) of the cases, respectively. CONCLUSIONS: PACER deployed in conjunction with electronic laboratory reports can enhance public health case reporting with additional relevant data. The architecture is modular in design, thereby allowing it to be used for any reportable condition, including evolving outbreaks. PACER allows for the creation of an enhanced and more complete case report that contains relevant case information that helps us to better understand the epidemiology of a disease.
Hypertension · 2021-07-26 · 177 citations
articleOpen accessACE (angiotensin-converting enzyme) inhibitors and angiotensin receptor blockers (ARBs) are equally guideline-recommended first-line treatments for hypertension, yet few head-to-head studies exist. We compared the real-world effectiveness and safety of ACE inhibitors versus ARBs in the first-line treatment of hypertension. We implemented a retrospective, new-user comparative cohort design to estimate hazard ratios using techniques to minimize residual confounding and bias, specifically large-scale propensity score adjustment, empirical calibration, and full transparency. We included all patients with hypertension initiating monotherapy with an ACE inhibitor or ARB between 1996 and 2018 across 8 databases from the United States, Germany, and South Korea. The primary outcomes were acute myocardial infarction, heart failure, stroke, and composite cardiovascular events. We also studied 51 secondary and safety outcomes including angioedema, cough, syncope, and electrolyte abnormalities. Across 8 databases, we identified 2 297 881 patients initiating treatment with ACE inhibitors and 673 938 patients with ARBs. We found no statistically significant difference in the primary outcomes of acute myocardial infarction (hazard ratio, 1.11 for ACE versus ARB [95% CI, 0.95–1.32]), heart failure (hazard ratio, 1.03 [0.87–1.24]), stroke (hazard ratio, 1.07 [0.91–1.27]), or composite cardiovascular events (hazard ratio, 1.06 [0.90–1.25]). Across secondary and safety outcomes, patients on ARBs had significantly lower risk of angioedema, cough, pancreatitis, and GI bleeding. In our large-scale, observational network study, ARBs do not differ statistically significantly in effectiveness at the class level compared with ACE inhibitors as first-line treatment for hypertension but present a better safety profile. These findings support preferentially prescribing ARBs over ACE inhibitors when initiating treatment for hypertension.
EVA: Generating Longitudinal Electronic Health Records Using Conditional\n Variational Autoencoders
arXiv (Cornell University) · 2020-12-17 · 7 citations
preprintOpen accessResearchers require timely access to real-world longitudinal electronic\nhealth records (EHR) to develop, test, validate, and implement machine learning\nsolutions that improve the quality and efficiency of healthcare. In contrast,\nhealth systems value deeply patient privacy and data security. De-identified\nEHRs do not adequately address the needs of health systems, as de-identified\ndata are susceptible to re-identification and its volume is also limited.\nSynthetic EHRs offer a potential solution. In this paper, we propose EHR\nVariational Autoencoder (EVA) for synthesizing sequences of discrete EHR\nencounters (e.g., clinical visits) and encounter features (e.g., diagnoses,\nmedications, procedures). We illustrate that EVA can produce realistic EHR\nsequences, account for individual differences among patients, and can be\nconditioned on specific disease conditions, thus enabling disease-specific\nstudies. We design efficient, accurate inference algorithms by combining\nstochastic gradient Markov Chain Monte Carlo with amortized variational\ninference. We assess the utility of the methods on large real-world EHR\nrepositories containing over 250, 000 patients. Our experiments, which include\nuser studies with knowledgeable clinicians, indicate the generated EHR\nsequences are realistic. We confirmed the performance of predictive models\ntrained on the synthetic data are similar with those trained on real EHRs.\nAdditionally, our findings indicate that augmenting real data with synthetic\nEHRs results in the best predictive performance - improving the best baseline\nby as much as 8% in top-20 recall.\n
BMJ Health & Care Informatics · 2020-03-01 · 26 citations
articleOpen accessINTRODUCTION: As the health system seeks to leverage large-scale data to inform population outcomes, the informatics community is developing tools for analysing these data. To support data quality assessment within such a tool, we extended the open-source software Observational Health Data Sciences and Informatics (OHDSI) to incorporate new functions useful for population health. METHODS: We developed and tested methods to measure the completeness, timeliness and entropy of information. The new data quality methods were applied to over 100 million clinical messages received from emergency department information systems for use in public health syndromic surveillance systems. DISCUSSION: While completeness and entropy methods were implemented by the OHDSI community, timeliness was not adopted as its context did not fit with the existing OHDSI domains. The case report examines the process and reasons for acceptance and rejection of ideas proposed to an open-source community like OHDSI.
EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders
arXiv (Cornell University) · 2020-12-18 · 9 citations
preprintOpen accessResearchers require timely access to real-world longitudinal electronic health records (EHR) to develop, test, validate, and implement machine learning solutions that improve the quality and efficiency of healthcare. In contrast, health systems value deeply patient privacy and data security. De-identified EHRs do not adequately address the needs of health systems, as de-identified data are susceptible to re-identification and its volume is also limited. Synthetic EHRs offer a potential solution. In this paper, we propose EHR Variational Autoencoder (EVA) for synthesizing sequences of discrete EHR encounters (e.g., clinical visits) and encounter features (e.g., diagnoses, medications, procedures). We illustrate that EVA can produce realistic EHR sequences, account for individual differences among patients, and can be conditioned on specific disease conditions, thus enabling disease-specific studies. We design efficient, accurate inference algorithms by combining stochastic gradient Markov Chain Monte Carlo with amortized variational inference. We assess the utility of the methods on large real-world EHR repositories containing over 250, 000 patients. Our experiments, which include user studies with knowledgeable clinicians, indicate the generated EHR sequences are realistic. We confirmed the performance of predictive models trained on the synthetic data are similar with those trained on real EHRs. Additionally, our findings indicate that augmenting real data with synthetic EHRs results in the best predictive performance - improving the best baseline by as much as 8% in top-20 recall.
Frequent coauthors
- 41 shared
Brian E. Dixon
Indiana University – Purdue University Indianapolis
- 37 shared
Shaun J. Grannis
Indiana University School of Medicine
- 37 shared
Paul Dexter
Indiana University – Purdue University Indianapolis
- 36 shared
Kathleen Toomey
Somerville Hospital
- 36 shared
Andrew G. Dean
- 36 shared
G. Allen Tindol
Memorial Health University Medical Center
- 36 shared
Stephen R. Pitts
- 30 shared
Burke W. Mamlin
Regenstrief Institute
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Jon Duke
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup