Resume-aware faculty matching

Find professors who actually fit you

Upload your resume. Four AI agents analyze your background, rank the faculty who fit, inspect their recent research, and help you draft outreach — grounded in their actual work, not templates.

Free to startNo credit cardCancel anytime
Top matches Balanced preset
Dr. Sarah Chen
Stanford · Interpretability · NLP
91
Dr. Marcus Holloway
MIT · Robotics · RL
84
Dr. Aisha Okonkwo
CMU · Fairness · HCI
82
Nova · Professor Researcher · re-ranking top 20…
AnHai Doan

AnHai Doan

· Vilas Distinguished Achievement Professor; Gurindar S. Sohi Professor

University of Wisconsin-Madison · Computer Sciences

Active 1994–2025

h-index56
Citations13.5k
Papers18716 last 5y
Funding$1.3M
See your match with AnHai Doan — sign in to PhdFit.Sign in

About

AnHai Doan is the Vilas Distinguished Achievement Professor and Gurindar S. Sohi Professor of Computer Science at the University of Wisconsin–Madison. His research goal is to make messy data usable at scale. He works on data integration, data science, and machine learning, building end-to-end systems that are deployed in real-world settings. He has received the ACM Doctoral Dissertation Award, NSF CAREER Award, and Sloan Fellowship, and co-authored Principles of Data Integration, a widely used textbook. AnHai has worked extensively at the intersection of academia and industry, serving on the advisory board of Transformic (acquired by Google), as Chief Scientist at Kosmix (acquired by Walmart), and co-founding GreenBay Technologies (acquired by Informatica). He has also served on the SIGMOD Advisory and Executive Committees and was Co-Chair of SIGMOD 2020. His background includes growing up in Vietnam, studying in Hungary, and earning his Ph.D. from the University of Washington in 2002. His career spans roles as a graduate student, professor, startup employee, and big-company employee, with interests outside of work in architecture, history, art, interior design, traveling, and long-distance hiking.

Research topics

  • Computer Science
  • Data Mining
  • Artificial Intelligence
  • World Wide Web
  • Data science
  • Database
  • Theoretical computer science
  • Algorithm
  • Mathematics

Selected publications

  • Columbo: Expanding Abbreviated Column Names for Tabular Data Using Large Language Models

    2025-01-01

    articleOpen accessSenior author

    Expanding the abbreviated column names of tables, such as "esal" to "employee salary", is critical for many downstream NLP tasks for tabular data, such as NL2SQL, table QA, and keyword search.This problem arises in enterprises, domain sciences, government agencies, and more.In this paper, we make three contributions that significantly advance the state of the art.First, we show that the synthetic public data used by prior work has major limitations, and we introduce four new datasets in enterprise/science domains, with real-world abbreviations.Second, we show that accuracy measures used by prior work seriously undercount correct expansions, and we propose new synonym-aware measures that capture accuracy much more accurately.Finally, we develop Columbo, a powerful LLM-based solution that exploits context, rules, chain-of-thought reasoning, and token-level analysis.Extensive experiments show that Columbo significantly outperforms NameGuess, the current most advanced solution, by 4-29%, over five datasets.Columbo has been used in production on EDI, a major data lake for environmental sciences.

  • Technical Perspective: Unicorn: A Unified Multi-Tasking Matching Model

    ACM SIGMOD Record · 2024-05-14

    article1st authorCorresponding

    Data integration has been a long-standing challenge for data management. It has recently received significant attention due to at least three main reasons. First, many data science projects require integrating data from disparate sources before analysis can be carried out to extract insights. Second, many organizations want to build knowledge graphs, such as Customer 360s, Product 360s, and Supplier 360s, which capture all available information about the customers, products, and suppliers of an organization. Building such knowledge graphs often requires integrating data from multiple sources. Finally, there is also an increasing need to integrate a massive amount of data to create training data for AI models, such as large language models.

  • Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching

    Proceedings of the VLDB Endowment · 2023-02-01 · 29 citations

    articleSenior author

    Blocking is a major task in entity matching. Numerous blocking solutions have been developed, but as far as we can tell, blocking using the well-known tf/idf measure has received virtually no attention. Yet, when we experimented with tf/idf blocking using Lucene, we found it did quite well. So in this paper we examine tf/idf blocking in depth. We develop Sparkly, which uses Lucene to perform top-k tf/idf blocking in a distributed share-nothing fashion on a Spark cluster. We develop techniques to identify good attributes and tokenizers that can be used to block on, making Sparkly completely automatic. We perform extensive experiments showing that Sparkly outperforms 8 state-of-the-art blockers. Finally, we provide an in-depth analysis of Sparkly's performance, regarding both recall/output size and runtime. Our findings suggest that (a) tf/idf blocking needs more attention, (b) Sparkly forms a strong baseline that future blocking work should compare against, and (c) future blocking work should seriously consider top-k blocking, which helps improve recall, and a distributed share-nothing architecture, which helps improve scalability, predictability, and extensibility.

  • Effective entity matching with transformers

    The VLDB Journal · 2023-01-17 · 28 citations

    article
  • Cloud data systems

    Proceedings of the VLDB Endowment · 2022-08-01

    article

    The panel will discuss the research opportunities for the database research community in the context of cloud native data services.

  • The Seattle report on database research

    Communications of the ACM · 2022 · 43 citations

    • Computer Science
    • Computer Science
    • Database

    Every five years, a group of the leading database researchers meet to reflect on their community's impact on the computing industry as well as examine current research challenges.

  • Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization

    2022 IEEE International Conference on Big Data (Big Data) · 2022-12-17 · 2 citations

    articleSenior author

    Many applications need to clean data with a target accuracy, e.g., with at least 95% precision. As far as we know, this problem has not been studied in depth. In this paper we take the first step toward solving it. We focus on value normalization (VN), the problem of replacing all strings that refer to the same entity with a unique string. VN is ubiquitous, and we often want to do VN with 100% accuracy. This is typically done today in industry by automatically clustering the strings then asking a user to verify and clean the clusters, until reaching 100% accuracy. This solution has significant limitations. It does not tell the users how to verify and clean the clusters. So the users often take ad-hoc, suboptimal, or incorrect actions. Verifying and cleaning also often take a lot of time, e.g., days. Further, there is no effective way for multiple users to collaboratively verify and clean. In this paper we address these challenges. Overall, our work advances the state of the art in data cleaning by introducing a novel cleaning problem and describing a promising solution template.

  • DEEP LEARNING FOR SEMANTIC MATCHING: A SURVEY

    Journal of Computer Science and Cybernetics · 2021-10-12 · 3 citations

    articleOpen accessSenior author

    Semantic matching finds certain types of semantic relationships among schema/data constructs. Examples include entity matching, entity linking, coreference resolution, schema/ontology matching, semantic text similarity, textual entailment, question answering, tagging, etc. Semantic matching has received much attention in the database, AI, KDD, Web, and Semantic Web communities. Recently, many works have also applied deep learning (DL) to semantic matching. In this paper we survey this fast growing topic. We define the semantic matching problem, categorize its variations into a taxonomy, and describe important applications. We describe DL solutions for important variations of semantic matching. Finally, we discuss future R\&D directions.

  • Deep learning for blocking in entity matching

    Proceedings of the VLDB Endowment · 2021 · 77 citations

    Senior authorCorresponding
    • Computer Science
    • Computer Science
    • Artificial Intelligence

    Entity matching (EM) finds data instances that refer to the same real-world entity. Most EM solutions perform blocking then matching. Many works have applied deep learning (DL) to matching, but far fewer works have applied DL to blocking. These blocking works are also limited in that they consider only a simple form of DL and some of them require labeled training data. In this paper, we develop the DeepBlocker framework that significantly advances the state of the art in applying DL to blocking for EM. We first define a large space of DL solutions for blocking, which contains solutions of varying complexity and subsumes most previous works. Next, we develop eight representative solutions in this space. These solutions do not require labeled training data and exploit recent advances in DL (e.g., sequence modeling, transformer, self supervision). We empirically determine which solutions perform best on what kind of datasets (structured, textual, or dirty). We show that the best solutions (among the above eight) outperform the best existing DL solution and the best existing non-DL solutions (including a state-of-the-art industrial non-DL solution), on dirty and textual data, and are comparable on structured data. Finally, we show that the combination of the best DL and non-DL solutions can perform even better, suggesting a new venue for research.

  • Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization

    arXiv (Cornell University) · 2021-01-13

    preprintOpen accessSenior author

    Many applications need to clean data with a target accuracy. As far as we know, this problem has not been studied in depth. In this paper we take the first step toward solving it. We focus on value normalization (VN), the problem of replacing all string that refer to the same entity with a unique string. VN is ubiquitous, and we often want to do VN with 100% accuracy. This is typically done today in industry by automatically clustering the strings then asking a user to verify and clean the clusters, until reaching 100% accuracy. This solution has significant limitations. It does not tell the users how to verify and clean the clusters. This part also often takes a lot of time, e.g., days. Further, there is no effective way for multiple users to collaboratively verify and clean. In this paper we address these challenges. Overall, our work advances the state of the art in data cleaning by introducing a novel cleaning problem and describing a promising solution template.

Recent grants

Frequent coauthors

  • Alon Halevy

    49 shared
  • Zachary G. Ives

    University of Pennsylvania

    22 shared
  • Raghu Ramakrishnan

    20 shared
  • Jeffrey F. Naughton

    19 shared
  • Peter Haddawy

    Mahidol University

    17 shared
  • Warren Shen

    13 shared
  • Sanjib Das

    Jadavpur University

    13 shared
  • Pedro Domingos

    Instituto de Tecnología Química

    13 shared

Labs

Awards & honors

  • Gurindar S. Sohi Professorship, 2020
  • Vilas Distinguished Achievement Professorship, 2018
  • Alfred Sloan Research Fellowship, 2007
  • NSF CAREER Award, 2004
  • ACM Doctoral Dissertation Award, 2003
  • Resume-aware match score
  • Save to shortlist
  • AI-drafted outreach

See your match with AnHai Doan

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

  • Free to start
  • No credit card
  • 30-second signup