Koustuv Saha
· Assistant ProfessorVerifiedUniversity of Illinois Urbana-Champaign · Computer Science
Active 2017–2026
About
Koustuv Saha is an Assistant Professor of Computer Science at the University of Illinois at Urbana-Champaign, starting his position in Fall 2023. He completed his Ph.D. in Computer Science from Georgia Tech in 2021 and holds a B.Tech. (Hons.) in Computer Science and Engineering from the Indian Institute of Technology Kharagpur. His research interests include social computing, computational social science, human-centered machine learning, and FATE (Fairness, Accountability, Transparency, and Ethics in AI). He adopts machine learning, natural language processing, and causal inference analysis to examine human behavior and wellbeing using digital data such as social media and multimodal sensing data. His work questions the underlying assumptions of data-driven inferences and explores the possible harms associated with them. His research is situated in an interdisciplinary and human-centered context, bearing implications for various stakeholders. Saha has published his work at venues including CHI, CSCW, ICWSM, IMWUT (UbiComp), Scientific Reports, JMIR, and FAccT. He has received several awards, including the 2021 Outstanding Doctoral Dissertation Award from Georgia Tech's College of Computing, the Foley Scholarship Award from the GVU Center, and the Snap Research Fellowship. His research has been recognized with the Outstanding Study Design Award at ICWSM and has been covered by media outlets such as the New York Times, Vox, CBC Radio, NBC, and others. Prior to his academic career, he was a Senior Researcher at Microsoft Research in Montreal, working in the FATE group, and completed research internships at Snap Research, Microsoft Research, Max Planck Institute, and Fred Hutch Cancer Research. He has six years of industry research experience and was awarded the NTSE Scholarship by the Government of India.
Research topics
- Computer Science
- Artificial Intelligence
- Psychology
- Virology
- Medicine
- Cognitive science
- Knowledge management
- Geography
- Psychiatry
- Human–computer interaction
- Management
- World Wide Web
Selected publications
2026-04-13 · 2 citations
articleOpen accessLimited English proficiency (LEP) patients in the U.S. face systemic barriers to healthcare beyond language and interpreter access, encompassing procedural and institutional constraints. AI advances may support communication and care through on-demand translation and visit preparation, but also risk exacerbating existing inequalities. We conducted storyboard-driven interviews with 14 patient navigators to explore how AI could shape care experiences for Spanish-speaking LEP individuals. We identified tensions around linguistic and cultural misunderstandings, privacy concerns, and opportunities and risks for AI to augment care workflows. Participants highlighted structural factors that can undermine trust in AI systems, including sensitive information disclosure, unstable technology access, and low literacy. While AI tools can potentially alleviate social barriers and institutional constraints, there are risks of misinformation and reducing human-to-human interactions. Our findings contribute AI design considerations that support LEP patients and care teams via rapport-building, educational and language support, and minimizing disruptions to existing practices.1
ACM Transactions on Computing for Healthcare · 2026-03-24 · 1 citations
articleOpen accessSenior authorFamily caregivers of individuals with Alzheimer’s Disease and Related Dementia (AD/ADRD) face significant emotional and logistical challenges that place them at heightened risk for stress, anxiety, and depression. Although recent advances in generative AI—particularly large language models (LLMs)—offer new opportunities to support mental health, little is known about how caregivers perceive and engage with such technologies. To address this gap, we developed Carey, a GPT-4o–based chatbot designed to provide informational and emotional support to AD/ADRD caregivers. Using Carey as a technology probe, we conducted semi-structured interviews with 16 family caregivers following scenario-driven interactions grounded in common caregiving stressors. Through inductive coding and reflexive thematic analysis, we surface a systemic understanding of caregiver needs and expectations across six themes— on-demand information access, safe space for disclosure, emotional support, crisis management, personalization, and data privacy . For each of these themes, we also identified the nuanced tensions in the caregivers’ desires and concerns. We present a mapping of caregiver needs, AI chatbots’ strengths, gaps, and design recommendations. Our findings offer theoretical and practical insights to inform the design of proactive, trustworthy, and caregiver-centered AI systems that better support the evolving mental health needs of AD/ADRD caregivers.
Social Simulacra in the Wild: AI Agent Communities on Moltbook
ArXiv.org · 2026-03-17
articleOpen accessSenior authorAs autonomous LLM-based agents increasingly populate social platforms, understanding the dynamics of AI-agent communities becomes essential for both communication research and platform governance. We present the first large-scale empirical comparison of AI-agent and human online communities, analyzing 73,899 Moltbook and 189,838 Reddit posts across five matched communities. Structurally, we find that Moltbook exhibits extreme participation inequality (Gini = 0.84 vs. 0.47) and high cross-community author overlap (33.8\% vs. 0.5\%). In terms of linguistic attributes, content generated by AI-agents is emotionally flattened, cognitively shifted toward assertion over exploration, and socially detached. These differences give rise to apparent community-level homogenization, but we show this is primarily a structural artifact of shared authorship. At the author level, individual agents are more identifiable than human users, driven by outlier stylistic profiles amplified by their extreme posting volume. As AI-mediated communication reshapes online discourse, our work offers an empirical foundation for understanding how multi-agent interaction gives rise to collective communication dynamics distinct from those of human communities.
A Social Media Lens on the Needs and Concerns of Information Workers
ACM Transactions on Social Computing · 2026-04-21
articleTechnological advancements have greatly impacted labor market dynamics, leaving a psychological impact on workers. Although some studies have explored such labor market changes and their effects on workers, they are limited to self-reported data, such as surveys and questionnaires. In this paper, we propose a new approach for identifying information workers’ challenges and their impact on workers’ emotional well-being using large-scale, inexpensive, and near-real-time online social network data. While the research is situated within the broader Future of Work context shaped by technological change, the data collected and analyzed focus specifically on IT-related skills and topics discussed on Reddit, rather than on artificial intelligence itself. Analyzing over 700,000 Reddit posts related to IT occupations, we identify major labor market topics, including education and skill development, job search, and employment concerns, and examine how workers’ emotional expressions vary across gender and age groups. Our findings reveal systematic differences in how workers from different demographic groups disclose their needs and emotional states online. This work demonstrates the value of social media data as a complementary lens for understanding workers’ challenges in the knowledge economy. It highlights the potential of online communities as vehicles for peer support and well-being interventions.
Social Simulacra in the Wild: AI Agent Communities on Moltbook
arXiv (Cornell University) · 2026-03-17
preprintOpen accessSenior authorAs autonomous LLM-based agents increasingly populate social platforms, understanding the dynamics of AI-agent communities becomes essential for both communication research and platform governance. We present the first large-scale empirical comparison of AI-agent and human online communities, analyzing 73,899 Moltbook and 189,838 Reddit posts across five matched communities. Structurally, we find that Moltbook exhibits extreme participation inequality (Gini = 0.84 vs. 0.47) and high cross-community author overlap (33.8\% vs. 0.5\%). In terms of linguistic attributes, content generated by AI-agents is emotionally flattened, cognitively shifted toward assertion over exploration, and socially detached. These differences give rise to apparent community-level homogenization, but we show this is primarily a structural artifact of shared authorship. At the author level, individual agents are more identifiable than human users, driven by outlier stylistic profiles amplified by their extreme posting volume. As AI-mediated communication reshapes online discourse, our work offers an empirical foundation for understanding how multi-agent interaction gives rise to collective communication dynamics distinct from those of human communities.
2026-04-13
articleSenior authorAgentic AI framework for End-to-End Medical Data Inference
ArXiv.org · 2025-07-24
preprintOpen accessSenior authorBuilding and deploying machine learning solutions in healthcare remains expensive and labor-intensive due to fragmented preprocessing workflows, model compatibility issues, and stringent data privacy constraints. In this work, we introduce an Agentic AI framework that automates the entire clinical data pipeline, from ingestion to inference, through a system of modular, task-specific agents. These agents handle both structured and unstructured data, enabling automatic feature selection, model selection, and preprocessing recommendation without manual intervention. We evaluate the system on publicly available datasets from geriatrics, palliative care, and colonoscopy imaging. For example, in the case of structured data (anxiety data) and unstructured data (colonoscopy polyps data), the pipeline begins with file-type detection by the Ingestion Identifier Agent, followed by the Data Anonymizer Agent ensuring privacy compliance, where we first identify the data type and then anonymize it. The Feature Extraction Agent identifies features using an embedding-based approach for tabular data, extracting all column names, and a multi-stage MedGemma-based approach for image data, which infers modality and disease name. These features guide the Model-Data Feature Matcher Agent in selecting the best-fit model from a curated repository. The Preprocessing Recommender Agent and Preprocessing Implementor Agent then apply tailored preprocessing based on data type and model requirements. Finally, the ``Model Inference Agent" runs the selected model on the uploaded data and generates interpretable outputs using tools like SHAP, LIME, and DETR attention maps. By automating these high-friction stages of the ML lifecycle, the proposed framework reduces the need for repeated expert intervention, offering a scalable, cost-efficient pathway for operationalizing AI in clinical environments.
Mental Wellbeing Effects of Disclosing Life Events on Social Media
Research Square · 2025-06-19
preprintOpen access1st authorCorresponding2025-10-21
articleOpen accessSenior author<sec> <title>BACKGROUND</title> Suicide is a critical global public health issue, with millions experiencing Suicidal Ideation (SI) each year. Online platforms, such as Reddit, provide spaces where individuals express suicidal thoughts and seek peer support. While prior computational research has leveraged machine learning and natural language analysis to detect SI, much of it lacks grounding in psychological theory, limiting interpretability and intervention design. </sec> <sec> <title>OBJECTIVE</title> This study applies the Interpersonal Theory of Suicide (IPTS) to understand the underlying psychosocial mechanisms driving high-risk suicidal intent in online spaces, analyze linguistic expressions of SI, and assess the role of AI systems in providing supportive responses. </sec> <sec> <title>METHODS</title> We analyzed 59,607 posts from Reddit’s r/SuicideWatch community. Posts were categorized into four SI dimensions: Loneliness, Lack of Reciprocal Love, Self-Hate, and Liability; and three IPTS-based RiskFactors: Thwarted Belongingness, Perceived Burdensomeness, and Acquired Capability for Suicide. High-risk posts were identified based on language markers of planning, attempts, and intent. We further conducted psycholinguistic and content analyses of supportive responses and evaluated AI chatbot-generated replies for structural coherence and empathy. </sec> <sec> <title>RESULTS</title> High-risk SI posts contained frequent references to planning and attempts (21.3%), methods and tools (18.6%), and expressions of weakness and pain (24.9%). Supportive peer responses varied significantly across SI stages (P < .001), with deeper empathy and self-disclosure emerging in replies to high-risk posts. AI chatbot responses demonstrated improved structural coherence (Cohen’s κ = 0.74) but were rated significantly lower on personalization and emotional depth (P < .001) by expert evaluators. </sec> <sec> <title>CONCLUSIONS</title> Grounding computational analysis in IPTS provides richer theoretical insight into SI expressed online. While AI-based systems can enhance the structural and linguistic quality of supportive messages, they currently lack the nuanced empathy and contextual awareness needed for effective mental health support. These findings highlight the need for theory-driven, human-AI collaborative frameworks in suicide prevention research and interventions. </sec>
Analyzing Conspiratorial Content Across Singapore-Based Telegram Groups
medRxiv · 2025-07-17
preprintOpen accessAbstract Telegram has emerged as a key platform for the circulation of conspiratorial narratives. We examine conspiratorial discourse within Singapore-based Telegram groups from 2021–2025. We analyze over 10 million words from three Telegram groups. We developed a logistic regression classifier to detect conspiratorial content, achieving an F1 score of 0.74 and expert-validated labeling accuracy of 72%. Topic models indicated dominant themes centered around elite control, vaccine risks, and globalist agendas. While most users rarely posted conspiratorial content, a small, highly active minority accounted for most of such messages. These users frequently forwarded messages across multiple groups, amplifying the spread of content with short but intense lifecycles (mean lifespan=6.8 days). Network analysis showed that users typically joined multiple groups in rapid succession and that conspiratorial messages traveled across groups within weeks. We underscore the importance of user-centric monitoring, time-sensitive interventions, and platform-specific models for content detection.
Frequent coauthors
- 147 shared
Munmun De Choudhury
- 67 shared
S. Mo Jang
- 65 shared
Orestis Papakyriakopoulos
- 62 shared
Kaveh Khoshnood
Yale University
- 61 shared
Joseph D. Tucker
- 60 shared
Weiming Tang
Shenzhen University
- 60 shared
Kamila Janmohamed
Yale University
- 60 shared
Chris T. Bauch
University of Waterloo
Education
- 2010
Ph.D., Computer Science
University of Illinois at Urbana-Champaign
- 2006
M.S., Computer Science
University of Illinois at Urbana-Champaign
- 2004
B.S., Computer Science
University of Illinois at Urbana-Champaign
Awards & honors
- Adamic-Glance Early Career Distinguished Award, AAAI Interna…
- OpEd/Public Voices Fellow, 2025
- Best Paper Honorable Mention, CHI, 2024
- Georgia Tech College of Computing Outstanding Dissertation A…
- Best Paper Honorable Mention, WristSense, 2020
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Koustuv Saha
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup