Najoung Kim

· Assistant Professor

New York University · Center for Data Science

Active 2016–2025

h-index18

Citations1.9k

Papers6747 last 5y

Funding—

Faculty page Lab page

OpenAlex

See your match with Najoung Kim — sign in to PhdFit.Sign in

About

Najoung Kim is an Assistant Professor at Boston University and a faculty fellow at the NYU Center for Data Science. Her research focuses on advancing the understanding and development of artificial intelligence and data science, with an emphasis on interdisciplinary applications. She has contributed to the field through original research that fosters collaborations across disciplines, working towards innovative solutions in AI and data science.

Research topics

Artificial Intelligence
Computer Science
Natural Language Processing
Programming language
Data Mining
Linguistics
Philosophy
Physics
Mathematics

Selected publications

Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It
ArXiv.org · 2025-07-17
preprintOpen accessSenior author
Does vision-and-language (VL) training change the linguistic representations of language models in meaningful ways? Most results in the literature have shown inconsistent or marginal differences, both behaviorally and representationally. In this work, we start from the hypothesis that the domain in which VL training could have a significant effect is lexical-conceptual knowledge, in particular its taxonomic organization. Through comparing minimal pairs of text-only LMs and their VL-trained counterparts, we first show that the VL models often outperform their text-only counterparts on a text-only question-answering task that requires taxonomic understanding of concepts mentioned in the questions. Using an array of targeted behavioral and representational analyses, we show that the LMs and VLMs do not differ significantly in terms of their taxonomic knowledge itself, but they differ in how they represent questions that contain concepts in a taxonomic relation vs. a non-taxonomic relation. This implies that the taxonomic knowledge itself does not change substantially through additional VL training, but VL training does improve the deployment of this knowledge in the context of a specific task, even when the presentation of the task is purely linguistic.
Publisher OA PDF DOI
RExBench: Can coding agents autonomously implement AI research extensions?
arXiv (Cornell University) · 2025-06-27
preprintOpen accessSenior author
Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research pipeline in machine learning and the natural sciences. We argue that research extension and its implementation is a critical capability for such systems, and introduce RExBench to support the evaluation of this capability. RExBench is a benchmark consisting of realistic extensions of 12 research papers that aim to investigate novel research hypotheses. Each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions. RExBench is robust to data contamination and supports an automatic evaluation infrastructure that executes agent outputs to determine whether the success criteria are met. We use this benchmark to evaluate 12 LLM agents implemented using two different frameworks, aider and OpenHands. We find that all agents fail to autonomously implement the majority of the extensions, with the best agent achieving around a 33% success rate. Although the success rate improves with additional human-written hints, the best performance under this setting remains below 44%. This indicates that current agents are still short of being able to handle realistic research extension tasks without substantial human guidance.
Publisher OA PDF DOI
Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models
2025-01-01
articleOpen access
Dana Arad, Yonatan Belinkov, Hanjie Chen, Najoung Kim, Hosein Mohebbi, Aaron Mueller, Gabriele Sarti, Martin Tutek. Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2025.
Publisher OA PDF DOI
Front Matter
2025-01-01
articleOpen access
Yonatan Belinkov, Aaron Mueller, Najoung Kim, Hosein Mohebbi, Hanjie Chen, Dana Arad, Gabriele Sarti. Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2025.
Publisher OA PDF DOI
Fake reefs are sometimes reefs and sometimes not, but are always compositional
Experiments in Linguistic Meaning · 2025-01-24
articleOpen access
The semantics of adjective modification often begins with set intersection,such that [[yellow flower]] = [[yellow]] ∩ [[flower]]. Thus a yellow flower is a flower. Such an account, however, runs into problems for adjectives like fake or counterfeit, which display a privative inference: a fake gun is not a gun and a counterfeit dollar is not a dollar. Moreover, recent work shows privativity cannot easily be encoded as a property of specific adjectives like counterfeit, since e.g. counterfeit watch robustly licenses the subsective inference of being a watch (Martin 2022). We gather judgments on nearly 800 adjective-noun bigrams (of which 180 are novel, i.e. zero corpus frequency), andshow that privativity depends on the adjective, noun and context, and can be manipulated for the very same adjective-noun bigram by presenting it in different contexts. This poses a challenge for theories which fix privativity as a property of the adjective and always use the same method of composition (Partee 2010, del Pinal 2015). Moreover, we find no difference in participant behavior between novel adjective-noun bigrams and high frequency ones, suggesting that the process is nonetheless compositional and not the result of convention or memorized idiosyncrasy. Our results support compositional accounts like Martin (2022) (which modifies del Pinal 2015) and Guerrini (2024), which treat privativity as context-dependent.
Publisher OA PDF DOI
Erasing More Than Intended? How Concept Erasure Degrades the Generation of Non-Target Concepts
ArXiv.org · 2025-01-16
preprintOpen access
Concept erasure techniques have recently gained significant attention for their potential to remove unwanted concepts from text-to-image models. While these methods often demonstrate promising results in controlled settings, their robustness in real-world applications and suitability for deployment remain uncertain. In this work, we (1) identify a critical gap in evaluating sanitized models, particularly in assessing their performance across diverse concept dimensions, and (2) systematically analyze the failure modes of text-to-image models post-erasure. We focus on the unintended consequences of concept removal on non-target concepts across different levels of interconnected relationships including visually similar, binomial, and semantically related concepts. To address this, we introduce EraseBench, a comprehensive benchmark for evaluating post-erasure performance. EraseBench includes over 100 curated concepts, targeted evaluation prompts, and a robust set of metrics to assess both effectiveness and side effects of erasure. Our findings reveal a phenomenon of concept entanglement, where erasure leads to unintended suppression of non-target concepts, causing spillover degradation that manifests as distortions and a decline in generation quality.
Publisher OA PDF DOI
CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists
2025-01-01 · 2 citations
articleOpen accessSenior author
Existing LLM-as-a-Judge approaches for evaluating text generation suffer from rating inconsistencies, with low agreement and high rating variance across different evaluator models.We attribute this to subjective evaluation criteria combined with Likert scale scoring in existing protocols.To address this issue, we introduce CheckEval, a checklist-based evaluation framework that improves rating reliability via decomposed binary questions.Through experiments with 12 evaluator models across multiple datasets, we first demonstrate that CheckEval strongly correlates with human judgments.More importantly, CheckEval dramatically improves the average agreement across evaluator models by 0.45 and reduces the score variance.CheckEval scores furthermore have the benefit of being more interpretable because it decomposes evaluation criteria into traceable binary decisions, allowing analyses of specific attributes driving quality judgments.
Publisher OA PDF DOI
Is analogy enough to draw novel adjective-noun inferences?
ArXiv.org · 2025-03-31
preprintOpen accessSenior author
Recent work (Ross et al., 2025, 2024) has argued that the ability of humans and LLMs respectively to generalize to novel adjective-noun combinations shows that they each have access to a compositional mechanism to determine the phrase's meaning and derive inferences. We study whether these inferences can instead be derived by analogy to known inferences, without need for composition. We investigate this by (1) building a model of analogical reasoning using similarity over lexical items, and (2) asking human participants to reason by analogy. While we find that this strategy works well for a large proportion of the dataset of Ross et al. (2025), there are novel combinations for which both humans and LLMs derive convergent inferences but which are not well handled by analogy. We thus conclude that the mechanism humans and LLMs use to generalize in these cases cannot be fully reduced to analogy, and likely involves composition.
Publisher OA PDF DOI
Erasing More Than Intended? How Concept Erasure Degrades the Generation of Non-Target Concepts
2025-10-19 · 1 citations
article
Publisher DOI
Code Pretraining Improves Entity Tracking Abilities of Language Models
arXiv (Cornell University) · 2024-05-31 · 1 citations
preprintOpen access1st authorCorresponding
Recent work has provided indirect evidence that pretraining language models on code improves the ability of models to track state changes of discourse entities expressed in natural language. In this work, we systematically test this claim by comparing pairs of language models on their entity tracking performance. Critically, the pairs consist of base models and models trained on top of these base models with additional code data. We extend this analysis to additionally examine the effect of math training, another highly structured data type, and alignment tuning, an important step for enhancing the usability of models. We find clear evidence that models additionally trained on large amounts of code outperform the base models. On the other hand, we find no consistent benefit of additional math training or alignment tuning across various model families.
Publisher OA PDF DOI

Frequent coauthors

Ellie Pavlick
43 shared
Tal Linzen
39 shared
Samuel R. Bowman
36 shared
Benjamin Van Durme
35 shared
Patrick Xia
32 shared
Ian Tenney
32 shared
Alexis Ross
29 shared
Roma Patel
29 shared

Awards & honors

Faculty Fellow Alumni Outcomes - NYU Center for Data Science
Moore-Sloan Fellows at CDS
DIRAC Fellow

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Najoung Kim

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you