Christina Boucher
· Ph.D. ProfessorVerifiedUniversity of Florida · Computer & Information Science & Engineering
Active 1984–2026
About
Christina Boucher is a Professor in the Department of Computer & Information Science & Engineering at the University of Florida. Her research focuses on human-centered computing, including the intersection of technology and learning, human-computer interaction, and educational technologies. She directs the Embodied Learning & Experience (ELX) Lab, which conducts research in cyberlearning and positive computing, aiming to develop technology-based approaches to solve real individual and societal problems.
Research topics
- Computer Science
- Data Mining
- Biology
- Artificial Intelligence
- Information Retrieval
- Statistics
- Algorithm
- Mathematics
- Theoretical computer science
- Database
- Computational biology
- Genetics
- Combinatorics
Selected publications
Nature Communications · 2026-04-18
articleOpen accessAbstract Chromosome 22q11.2 microdeletion syndrome (22q11.2DS) is mediated by high-identity polymorphic low-copy repeats (LCRA-to-D) that have been challenging to sequence characterize. We sequence-resolved 135 chromosome 22q11.2 haplotypes from diverse humans and define 63 distinct structural configurations differing in size by 11-fold for LCRA. This diversity is driven by a 105 kbp segmental duplication flanked by 25 kbp inverted repeats that arose in the apes but expanded in humans ~1 million years ago. African LCRA haplotypes are significantly longer ( p = 0.0047) and predicted to be more protective against 22q11.2DS ( p = 1.14×10 -6 ) due to enrichment of inverted 105 kbp repeats. We identify nine distinct (including five recurrent) inversions spanning LCRA-D. Sequencing four families indicates LCRA-D deletions map to 105 kbp repeats, whereas inversions map to the 25 kbp repeats. Here, we show specific haplotype LCR architectures and recurrent large-scale inversions modulate susceptibility to 22q11.2DS and help explain its reduced prevalence among individuals of African ancestry.
Rapid-PFP: Accelerating Prefix-Free Parsing with GPU Parallelism
bioRxiv (Cold Spring Harbor Laboratory) · 2026-05-01
articleSenior authorABSTRACT Prefix-Free Parsing (PFP) is widely used in genomic data processing to construct compressed indexes on massive, highly repetitive datasets. However, existing CPU implementations are constrained by sequential bottlenecks, limiting their ability to scale to large-scale modern pangenomic collections. We introduce RAPID-PFP , a redesigned implementation of the PFP algorithm that takes advantage of the massive parallelism and high memory bandwidth of modern GPUs. RAPID-PFP parallelizes trigger-string detection, phrase parsing, dictionary construction, and parse generation through custom CUDA kernels and GPU-resident data structures built using cuDF, CuPy, and Numba-CUDA. The algorithm operates entirely within GPU memory, minimizes host interaction, and dynamically adapts to available VRAM, enabling efficient processing in a range of hardware configurations. Across E. coli and Human Pangenome (HPRC) datasets, RAPID-PFP produces identical output to established CPU pipelines while delivering an order-of-magnitude acceleration. On 3,682 E. coli assemblies, RAPID-PFP reduces runtime from 552 seconds to 17 seconds compared to PFP-FL (32.1 times) and from 1,078 seconds to 17 seconds compared to PFP-ITL (62.6 times). On the complete 46-sample HPRC dataset, RAPID-PFP achieves a 33.4 time speedup and successfully processes scales that PFP-ITL cannot handle. Performance improves with dataset size, reflecting that PFP maps naturally onto thousands of CUDA cores, yielding sublinear scaling relative to CPU implementations. RAPID-PFP demonstrates that foundational compressed-indexing algorithms can be re-engineered for accelerators, enabling scalable and practical preprocessing for large-scale genomic indexing workflows.
bioRxiv (Cold Spring Harbor Laboratory) · 2026-04-29
articleSenior authorAbstract Multidrug-resistant and extensively drug-resistant Mycobacterium tuberculosis (MTB) represents a growing global health crisis, characterized by limited treatment options and high mortality rates. Rapid and accurate prediction of resistance profiles is critical to guide effective therapy and curb transmission. Whole-genome sequencing (WGS) offers promise for individualized resistance profiling, yet existing computational tools remain constrained by predefined mutation catalogs and prohibitive resource requirements for large-scale analyses. Here, we present AURA, a GPU-accelerated, pangenome-scale machine learning framework for de novo resistance prediction. Trained on 12,185 globally diverse MTB isolates, AURA predicts resistance to 13 first-line, second-line, and repurposed antibiotics with high precision and identifies 59 novel resistance-associated loci, including variants in katG, pncA, rpoC , and members of the PE/PGRS gene family. By enabling model training on an unprecedented genomic scale, AURA provides new insights into the genetic architecture of resistance and establishes a scalable platform for precision-guided therapy and global surveillance of MTB.
Building genomic data structures from compressed representations using prefix-free parsing
Genome Research · 2026-05-15
preprintSenior authorAdvances in high-throughput sequencing have lowered the cost and complexity of genome sequencing, making it possible for the first time to assemble large pangenomic data sets for many species. These data sets, comprising thousands of individuals, already span from hundreds of gigabytes to petabytes, far exceeding the memory capacity of most machines, and are expected to continue growing in scale over time. Already, many traditional bioinformatics tools fail on inputs at this scale because they cannot construct their necessary data structures within memory limits. There is a growing need for methods that can construct these structures directly from compressed representations. Prefix-free parsing (PFP) addresses this challenge. PFP serves as a preprocessing step that compresses sufficiently repetitive text, yet still permits building important data structures for the original data set from its compressed output. This survey offers an overview of PFP, covering its core principles, the primary data structures it enables, current applications, and future research directions.
Infection Control and Hospital Epidemiology · 2025-08-22 · 2 citations
articleAbstract Objective: Assess the feasibility and effect of Enhanced Barrier Precautions (EBP) on the transmission of Staphylococcus aureus (SA) and carbapenem-resistant organisms (CRO) among residents in nursing home chronic ventilator units (NH-CVU). Design: Pre-post interventional study. Setting: Two community-based nursing homes with CVUs in Maryland. A total of 56 residents were enrolled in the baseline period and 64 residents were enrolled in the intervention period. Methods: During a 3-month baseline and intervention period, residents were swabbed monthly to estimate SA and CRO acquisition. During a 2-month training period, EBP was implemented for residents with chronic wounds, medical devices, or history of multidrug-resistant organism (MDRO) colonization. During the subsequent 3-month intervention period, healthcare personnel (HCP) wore gowns and gloves for high-contact care activities when residents were on EBP. Whole genome sequencing assessed resident-to-resident transmission. Results: At baseline, NH-CVU1 used gowns and gloves for all direct contact, while NH-CVU2 used EBP only for residents with a history of MDRO colonization. After training, the proportion of NH-CVU2 residents on EBP increased from 65% in the baseline period to 87% in the intervention period. Glove use was high (93–98%) in both NH-CVUs. Gown use increased from 39% to 77% in NH-CVU1 and from 26% to 72% in NH-CVU2. Resident-to-resident transmission of SA or CRO decreased by 25% in NH-CVU1 ( p = 0.60) and by 67% in NH-CVU2 ( p = 0.05). CRO transmission decreased by 33% in NH-CVU1 ( p = 0.54) and by 83% in NH-CVU2 ( p = 0.02). Conclusions: EBP is feasible and potentially decreases overall and CRO transmission in nursing home CVUs.
Formal verification of bioinformatics software using model checking and theorem proving
Briefings in Bioinformatics · 2025-07-01
articleOpen accessWhile there is explosive growth in the creation of biological data, researchers rely on ad hoc verification methods such as testing with small simulated datasets. Due to their importance in biology and biomedicine, there is a critical need to verify these algorithms as well as their implementations to ensure that the results and conclusions are trustworthy. In this paper, we explore an effective combination of model checking and theorem proving of bioinformatics software, including BiopLib, BWA, Jellyfish, SDSL, Dashing, SPAdes, and MUMmer. We provide results for model checking for bioinfomatics software libraries and theorem proving for specific properties. Our model checking framework found several potential flaws in the two tools (BiopLib and BWA). We have also detected several failing cases in Succinct Data Structures Library (SDSL).
Robust 16S rRNA classification based on a compressed LCA index
Genome Research · 2025-08-25 · 2 citations
articleOpen accessTaxonomic sequence classification is a computational problem central to the study of metagenomics and evolution. Advances in compressed indexing with the r -index enable full-text pattern matching against large sequence collections. But the data structures that link pattern sequences to their clades of origin still do not scale well to large collections. Previous work proposed the document array profiles, which use <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" display="inline" overflow="scroll"> <m:mrow> <m:mi class="MJX-tex-caligraphic" mathvariant="script">O</m:mi> </m:mrow> <m:mo>(</m:mo> <m:mi>r</m:mi> <m:mi>d</m:mi> <m:mo>)</m:mo> </m:math> words of space, where r is the number of maximal equal-letter runs in the Burrows–Wheeler transform, and d is the number of distinct genomes. The linear dependence on d is limiting, because real taxonomies can easily contain 10,000s of leaves or more. We propose a method called cliff compression that reduces this size by a large factor, >250× when indexing the SILVA 16S rRNA gene database. This method uses <m:math xmlns:m="http://www.w3.org/1998/Math/MathML" display="inline" overflow="scroll"> <m:mi mathvariant="normal">Θ</m:mi> <m:mo>(</m:mo> <m:mi>r</m:mi> <m:mi>log</m:mi> <m:mo></m:mo> <m:mi>d</m:mi> <m:mo>)</m:mo> </m:math> words of space in expectation under a random model we propose here. We implemented these ideas in an open-source tool called Cliffy that performs efficient taxonomic classification of sequencing reads with respect to a compressed taxonomic index. When applied to simulated 16S rRNA reads, Cliffy's read-level accuracy is higher than Kraken2's by 11%–18%. Clade abundances are also more accurately predicted by Cliffy compared with Kraken2 and Bracken. Overall, Cliffy is a fast and space-economical extension to compressed full-text indexes, enabling them to perform fast and accurate taxonomic classification queries. Cliffy's accuracy underscores the advantages of full-text indexes, which offer a more precise solution compared with k -mer indexes designed for a specific k value.
SDSL-Mobile: Enabling space-efficient data structures for mobile applications
SoftwareX · 2025-06-24
articleOpen accessSenior authorCorrespondingThis paper presents the process and results of porting the Succinct Data Structure Library 2.0 (SDSL-lite), a robust and well-established open-source C++11 library, to Android platforms. The resulting library, called SDSL-Mobile, implements space-efficient data structures, including wavelet trees, compressed suffix arrays, and bit vectors, which are essential for handling large datasets in domains such as bioinformatics and information retrieval. Although originally designed for desktop environments, the library is extended to Android using the Android Native Development Kit (NDK) to enable integration into mobile platforms. Functionality is evaluated by implementing wavelet forests within an Android application, and performance is compared against a desktop implementation. The results demonstrate the feasibility of deploying succinct data structures on mobile devices, highlighting new possibilities for advanced data processing in resource-constrained environments.
Accurate short-read alignment through<i>r</i>-index-based pangenome indexing
Genome Research · 2025-06-12 · 2 citations
articleOpen accessSenior authorAligning to a linear reference genome can result in a higher percentage of reads going unmapped or being incorrectly mapped owing to variations not captured by the reference, otherwise known as reference bias. Recently, in efforts to mitigate reference bias, there has been a movement to switch to using pangenomes, a collection of genomes, as the reference. In this paper, we introduce Moni-align, the first short-read pangenome aligner built on the r -index, a variation of the classical FM-index that can index collections of genomes in O( r )-space, where r is the number of runs in the Burrows–Wheeler transform. Moni-align uses a seed-and-extend strategy for aligning reads, utilizing maximal exact matches as seeds, which can be efficiently obtained with the r -index. Using both simulated and real short-read data sets, we demonstrate that Moni-align achieves alignment accuracy comparable to vg map and vg giraffe, the leading pangenome aligners. Although currently best suited for aligning to localized pangenomes owing to computational constraints, Moni-align offers a robust foundation for future optimizations that could further broaden its applicability.
SSRN Electronic Journal · 2025-01-01
preprintOpen accessSenior author
Recent grants
Developing Computational Methods for Surveillance of Antimicrobial Resistant Agents
NIH · $1.7M · 2018–2025
Collaborative Research: EAGER: Solving the bait learning problem for large-scale DNA enrichment
NSF · $159k · 2021–2024
III: Small: Collaborative Research: A Scalable and Efficient Optical Map Assembler
NSF · $397k · 2016–2021
SCH: INT: Enabling real time surveillance of antimicrobial resistance
NSF · $1.2M · 2021–2026
Developing Computational Methods for Surveillance of Antimicrobial Resistant Agents
NIH · $450k · 2018–2023
Frequent coauthors
- 41 shared
Travis Gagie
Dalhousie University
- 29 shared
Mattia Prosperi
University of Florida
- 25 shared
Noelle Noyes
University of Minnesota
- 24 shared
Massimiliano Rossi
Illumina (United States)
- 19 shared
Giovanni Manzini
- 19 shared
Marco Antônio Oliva
University of Florida
- 16 shared
Alan Kuhnle
- 15 shared
Ben Langmead
Johns Hopkins University
Awards & honors
- UF Term Professorship, 2021
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Christina Boucher
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup