Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
Najoung Kim

Najoung Kim

· Assistant Professor

New York University · Center for Data Science

Active 2016–2025

h-index18
Citations1.9k
Papers6747 last 5y
Funding
See your match with Najoung Kim — sign in to PhdFit.Sign in

About

Najoung Kim is an Assistant Professor at Boston University and a faculty fellow at the NYU Center for Data Science. Her research focuses on advancing the understanding and development of artificial intelligence and data science, with an emphasis on interdisciplinary applications. She has contributed to the field through original research that fosters collaborations across disciplines, working towards innovative solutions in AI and data science.

Research topics

  • Artificial Intelligence
  • Computer Science
  • Natural Language Processing
  • Programming language
  • Data Mining
  • Linguistics
  • Philosophy
  • Physics
  • Mathematics

Selected publications

  • Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It

    ArXiv.org · 2025-07-17

    preprintOpen accessSenior author

    Does vision-and-language (VL) training change the linguistic representations of language models in meaningful ways? Most results in the literature have shown inconsistent or marginal differences, both behaviorally and representationally. In this work, we start from the hypothesis that the domain in which VL training could have a significant effect is lexical-conceptual knowledge, in particular its taxonomic organization. Through comparing minimal pairs of text-only LMs and their VL-trained counterparts, we first show that the VL models often outperform their text-only counterparts on a text-only question-answering task that requires taxonomic understanding of concepts mentioned in the questions. Using an array of targeted behavioral and representational analyses, we show that the LMs and VLMs do not differ significantly in terms of their taxonomic knowledge itself, but they differ in how they represent questions that contain concepts in a taxonomic relation vs. a non-taxonomic relation. This implies that the taxonomic knowledge itself does not change substantially through additional VL training, but VL training does improve the deployment of this knowledge in the context of a specific task, even when the presentation of the task is purely linguistic.

  • RExBench: Can coding agents autonomously implement AI research extensions?

    arXiv (Cornell University) · 2025-06-27

    preprintOpen accessSenior author

    Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research pipeline in machine learning and the natural sciences. We argue that research extension and its implementation is a critical capability for such systems, and introduce RExBench to support the evaluation of this capability. RExBench is a benchmark consisting of realistic extensions of 12 research papers that aim to investigate novel research hypotheses. Each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions. RExBench is robust to data contamination and supports an automatic evaluation infrastructure that executes agent outputs to determine whether the success criteria are met. We use this benchmark to evaluate 12 LLM agents implemented using two different frameworks, aider and OpenHands. We find that all agents fail to autonomously implement the majority of the extensions, with the best agent achieving around a 33% success rate. Although the success rate improves with additional human-written hints, the best performance under this setting remains below 44%. This indicates that current agents are still short of being able to handle realistic research extension tasks without substantial human guidance.

  • Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models

    2025-01-01

    articleOpen access

    Dana Arad, Yonatan Belinkov, Hanjie Chen, Najoung Kim, Hosein Mohebbi, Aaron Mueller, Gabriele Sarti, Martin Tutek. Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2025.

  • Front Matter

    2025-01-01

    articleOpen access

    Yonatan Belinkov, Aaron Mueller, Najoung Kim, Hosein Mohebbi, Hanjie Chen, Dana Arad, Gabriele Sarti. Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2025.

  • Fake reefs are sometimes reefs and sometimes not, but are always compositional

    Experiments in Linguistic Meaning · 2025-01-24

    articleOpen access

    The semantics of adjective modification often begins with set intersection,such that [[yellow flower]] = [[yellow]] ∩ [[flower]]. Thus a yellow flower is a flower. Such an account, however, runs into problems for adjectives like fake or counterfeit, which display a privative inference: a fake gun is not a gun and a counterfeit dollar is not a dollar. Moreover, recent work shows privativity cannot easily be encoded as a property of specific adjectives like counterfeit, since e.g. counterfeit watch robustly licenses the subsective inference of being a watch (Martin 2022). We gather judgments on nearly 800 adjective-noun bigrams (of which 180 are novel, i.e. zero corpus frequency), andshow that privativity depends on the adjective, noun and context, and can be manipulated for the very same adjective-noun bigram by presenting it in different contexts. This poses a challenge for theories which fix privativity as a property of the adjective and always use the same method of composition (Partee 2010, del Pinal 2015). Moreover, we find no difference in participant behavior between novel adjective-noun bigrams and high frequency ones, suggesting that the process is nonetheless compositional and not the result of convention or memorized idiosyncrasy. Our results support compositional accounts like Martin (2022) (which modifies del Pinal 2015) and Guerrini (2024), which treat privativity as context-dependent.

  • Erasing More Than Intended? How Concept Erasure Degrades the Generation of Non-Target Concepts

    ArXiv.org · 2025-01-16

    preprintOpen access

    Concept erasure techniques have recently gained significant attention for their potential to remove unwanted concepts from text-to-image models. While these methods often demonstrate promising results in controlled settings, their robustness in real-world applications and suitability for deployment remain uncertain. In this work, we (1) identify a critical gap in evaluating sanitized models, particularly in assessing their performance across diverse concept dimensions, and (2) systematically analyze the failure modes of text-to-image models post-erasure. We focus on the unintended consequences of concept removal on non-target concepts across different levels of interconnected relationships including visually similar, binomial, and semantically related concepts. To address this, we introduce EraseBench, a comprehensive benchmark for evaluating post-erasure performance. EraseBench includes over 100 curated concepts, targeted evaluation prompts, and a robust set of metrics to assess both effectiveness and side effects of erasure. Our findings reveal a phenomenon of concept entanglement, where erasure leads to unintended suppression of non-target concepts, causing spillover degradation that manifests as distortions and a decline in generation quality.

  • CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists

    2025-01-01 · 2 citations

    articleOpen accessSenior author

    Existing LLM-as-a-Judge approaches for evaluating text generation suffer from rating inconsistencies, with low agreement and high rating variance across different evaluator models.We attribute this to subjective evaluation criteria combined with Likert scale scoring in existing protocols.To address this issue, we introduce CheckEval, a checklist-based evaluation framework that improves rating reliability via decomposed binary questions.Through experiments with 12 evaluator models across multiple datasets, we first demonstrate that CheckEval strongly correlates with human judgments.More importantly, CheckEval dramatically improves the average agreement across evaluator models by 0.45 and reduces the score variance.CheckEval scores furthermore have the benefit of being more interpretable because it decomposes evaluation criteria into traceable binary decisions, allowing analyses of specific attributes driving quality judgments.

  • Is analogy enough to draw novel adjective-noun inferences?

    ArXiv.org · 2025-03-31

    preprintOpen accessSenior author

    Recent work (Ross et al., 2025, 2024) has argued that the ability of humans and LLMs respectively to generalize to novel adjective-noun combinations shows that they each have access to a compositional mechanism to determine the phrase's meaning and derive inferences. We study whether these inferences can instead be derived by analogy to known inferences, without need for composition. We investigate this by (1) building a model of analogical reasoning using similarity over lexical items, and (2) asking human participants to reason by analogy. While we find that this strategy works well for a large proportion of the dataset of Ross et al. (2025), there are novel combinations for which both humans and LLMs derive convergent inferences but which are not well handled by analogy. We thus conclude that the mechanism humans and LLMs use to generalize in these cases cannot be fully reduced to analogy, and likely involves composition.

  • Erasing More Than Intended? How Concept Erasure Degrades the Generation of Non-Target Concepts

    2025-10-19 · 1 citations

    article
  • Code Pretraining Improves Entity Tracking Abilities of Language Models

    arXiv (Cornell University) · 2024-05-31 · 1 citations

    preprintOpen access1st authorCorresponding

    Recent work has provided indirect evidence that pretraining language models on code improves the ability of models to track state changes of discourse entities expressed in natural language. In this work, we systematically test this claim by comparing pairs of language models on their entity tracking performance. Critically, the pairs consist of base models and models trained on top of these base models with additional code data. We extend this analysis to additionally examine the effect of math training, another highly structured data type, and alignment tuning, an important step for enhancing the usability of models. We find clear evidence that models additionally trained on large amounts of code outperform the base models. On the other hand, we find no consistent benefit of additional math training or alignment tuning across various model families.

Frequent coauthors

Awards & honors

  • Faculty Fellow Alumni Outcomes - NYU Center for Data Science
  • Moore-Sloan Fellows at CDS
  • DIRAC Fellow
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with Najoung Kim

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup