
Arjun Guha
VerifiedNortheastern University · Software Engineering
Active 2005–2026
About
Arjun Guha is an associate professor in the Khoury College of Computer Sciences at Northeastern University, based in Boston. His research focuses on programming languages, with particular interest in security and reliability problems in web programming, systems, and robotics. Guha uses tools and techniques from programming languages to address these issues, and one of his recent projects aims to make serverless computing more cost-effective, reliable, and applicable. He is a member of the Programming Research Laboratory. Prior to joining Northeastern, Guha was an associate professor at the University of Massachusetts Amherst and a postdoctoral research associate at Cornell University. His work has received several awards, including an OOPSLA Most Influential Paper Award, a PLDI Distinguished Paper Award, and a PACT Best Paper Award. In his free time, Guha enjoys running, cooking, and reading.
Research topics
- Artificial Intelligence
- Computer Science
- World Wide Web
- Machine Learning
- Operating system
- Programming language
- Software engineering
- Theoretical computer science
Selected publications
Learning Reasoning World Models for Parallel Code
arXiv (Cornell University) · 2026-04-22
preprintOpen accessLarge language models have shown remarkable ability in serial code generation, but they still struggle with parallel code for which training data is comparatively scarce. A common remedy is to use coding agents that interact with external tools, but tool calls can be costly and sometimes impractical, e.g., for partially written code. We propose Parallel-Code World Models (PCWMs), reasoning LLMs that aim to predict tool outcomes directly from parallel source code. To train PCWMs, we design a novel exploration and data generation pipeline that samples diverse parallel-coding problems and candidate implementations across multiple domains, then executes them via tools to record data races and performance profiles. From these, we synthesize hindsight reasoning traces that causally connect source code to observed tool outcomes. Fine-tuning on the resulting data yields noticeable gains, with a 7B-parameter world model improving from 64.3% to 72.8% accuracy in race-outcome prediction, while an 8B-parameter model improves in a performance profiling task from 49.3% to 58.6% accuracy. Furthermore, when open-weight models were tasked with fixing data races, world-model feedback improved their race-fixing rates relative to self-feedback by 2.7%-9.1% using our 7B-parameter world model and by 6.1%-11.1% using our 14B-parameter world model. Our results suggest that reasoning models have the potential to serve as practical substitutes for external tool calls in parallel-coding agents.
Steering Code LLMs with Activation Directions for Language and Library Control
arXiv (Cornell University) · 2026-03-24
preprintOpen accessCode LLMs often default to particular programming languages and libraries under neutral prompts. We investigate whether these preferences are encoded as approximately linear directions in activation space that can be manipulated at inference time. Using a difference-in-means method, we estimate layer-wise steering vectors for five language/library pairs and add them to model hidden states during generation. Across three open-weight code LLMs, these interventions substantially increase generation toward the target ecosystem under neutral prompts and often remain effective even when prompts explicitly request the opposite choice. Steering strength varies by model and target, with common ecosystems easier to induce than rarer alternatives, and overly strong interventions can reduce output quality. Overall, our results suggest that code-style preferences in LLMs are partly represented by compact, steerable structure in activation space.
Steering Code LLMs with Activation Directions for Language and Library Control
ArXiv.org · 2026-03-24
articleOpen accessCode LLMs often default to particular programming languages and libraries under neutral prompts. We investigate whether these preferences are encoded as approximately linear directions in activation space that can be manipulated at inference time. Using a difference-in-means method, we estimate layer-wise steering vectors for five language/library pairs and add them to model hidden states during generation. Across three open-weight code LLMs, these interventions substantially increase generation toward the target ecosystem under neutral prompts and often remain effective even when prompts explicitly request the opposite choice. Steering strength varies by model and target, with common ecosystems easier to induce than rarer alternatives, and overly strong interventions can reduce output quality. Overall, our results suggest that code-style preferences in LLMs are partly represented by compact, steerable structure in activation space.
Learning Reasoning World Models for Parallel Code
ArXiv.org · 2026-04-22
articleOpen accessLarge language models have shown remarkable ability in serial code generation, but they still struggle with parallel code for which training data is comparatively scarce. A common remedy is to use coding agents that interact with external tools, but tool calls can be costly and sometimes impractical, e.g., for partially written code. We propose Parallel-Code World Models (PCWMs), reasoning LLMs that aim to predict tool outcomes directly from parallel source code. To train PCWMs, we design a novel exploration and data generation pipeline that samples diverse parallel-coding problems and candidate implementations across multiple domains, then executes them via tools to record data races and performance profiles. From these, we synthesize hindsight reasoning traces that causally connect source code to observed tool outcomes. Fine-tuning on the resulting data yields noticeable gains, with a 7B-parameter world model improving from 64.3% to 72.8% accuracy in race-outcome prediction, while an 8B-parameter model improves in a performance profiling task from 49.3% to 58.6% accuracy. Furthermore, when open-weight models were tasked with fixing data races, world-model feedback improved their race-fixing rates relative to self-feedback by 2.7%-9.1% using our 7B-parameter world model and by 6.1%-11.1% using our 14B-parameter world model. Our results suggest that reasoning models have the potential to serve as practical substitutes for external tool calls in parallel-coding agents.
Understanding How CodeLLMs (Mis)Predict Types with Activation Steering
2025-01-01
articleOpen accessSenior authorReasoningWeekly: A General Knowledge and Verbal Reasoning Challenge for Large Language Models
2025-01-01
articleOpen accessSenior authorZixuan Wu, Francesca Lucchetti, Aleksander Boruch-Gruszecki, Jingmiao Zhao, Carolyn Jane Anderson, Joydeep Biswas, Federico Cassano, Arjun Guha. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics. 2025.
ReasoningWeekly: A General Knowledge and Verbal Reasoning Challenge for Large Language Models
ArXiv.org · 2025-02-03 · 1 citations
preprintOpen accessSenior authorExisting benchmarks for frontier models often test specialized, "PhD-level" knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark with 613 problems based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models; however correct solutions are easy to verify, and models' mistakes are easy to spot. As LLMs are more widely deployed in society, we believe it is useful to develop benchmarks for frontier models that humans can understand without the need for deep domain expertise. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models on our benchmark, despite being on par with other models when tested on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with "I give up" before providing an answer that it knows is wrong. R1 can also be remarkably "uncertain" in its output and in rare cases, it does not "finish thinking," which suggests the need for techniques to ``wrap up'' before the context window limit is reached. We also quantify the effectiveness of reasoning longer to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.
Bridging the Gap Between Binary and Source Based Package Management in Spack
2025-11-12 · 1 citations
articleOpen accessBinary package managers install software quickly but they limit configurability due to rigid ABI requirements that ensure compatibility between binaries. Source package managers provide flexibility in building software, but compilation can be slow. For example, installing an HPC code with a new MPI implementation may result in a full rebuild. Spack, a widely deployed, HPC-focused package manager, can use source and pre-compiled binaries, but lacks a binary compatibility model, so it cannot mix binaries not built together. We present splicing, an extension to Spack that models binary compatibility between packages and allows seamless mixing of source and binary distributions. Splicing augments Spack’s packaging language and dependency resolution engine to reuse compatible binaries but maintains the flexibility of source builds. It incurs minimal installation-time overhead and allows rapid installation from binaries, even for ABI-sensitive dependencies like MPI that would otherwise require many rebuilds.
Substance Beats Style: Why Beginning Students Fail to Code with LLMs
2025-01-01 · 2 citations
articleOpen accessFrancesca Lucchetti, Zixuan Wu, Arjun Guha, Molly Q Feldman, Carolyn Jane Anderson. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025.
Leveraging AI for Productive and Trustworthy HPC Software: Challenges and Research Directions
Lecture notes in computer science · 2025-11-23 · 1 citations
book-chapterOpen access
Recent grants
SHF:Small:A Language-based Approach to Faster and Safer Serverless Computing
NSF · $457k · 2020–2024
NeTS: Large: Collaborative Research:Programmable Inter-Domain Observation and Control
NSF · $692k · 2014–2019
Collaborative Research: FMitF: Track I: Game Theoretic Updates for Network and Cloud Functions
NSF · $295k · 2020–2021
Collaborative Research: FMitF: Track I: Game Theoretic Updates for Network and Cloud Functions
NSF · $295k · 2020–2024
Collaborative Research: SHF: Small: Interactive Synthesis and Repair For Robot Programs
NSF · $194k · 2020–2024
Frequent coauthors
- 55 shared
Shriram Krishnamurthi
- 20 shared
Joe Gibbs Politz
University of California, San Diego
- 19 shared
Joydeep Biswas
The University of Texas at Austin
- 17 shared
Abhinav Jangda
Microsoft (United States)
- 17 shared
Carolyn Jane Anderson
- 15 shared
Federico Cassano
Northeastern University
- 13 shared
Nate Foster
Cornell University
- 12 shared
Donald Pinckney
Northeastern University
Labs
Khoury College of Computer SciencesPI
Education
- 2007
Ph.D., Computer Science
University of California, Los Angeles
- 2003
M.S., Computer Science
University of California, Los Angeles
- 2001
B.S., Computer Science
University of California, Los Angeles
Awards & honors
- OOPSLA Most Influential Paper Award
- PLDI Distinguished Paper Award
- PACT Best Paper Award
- Distinguished Paper Award (2019)
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Arjun Guha
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup