AnHai Doan

· Vilas Distinguished Achievement Professor; Gurindar S. Sohi Professor

University of Wisconsin-Madison · Computer Sciences

Active 1994–2025

h-index56

Citations13.5k

Papers18716 last 5y

Funding$1.3M

Faculty page Lab page Website

See your match with AnHai Doan — sign in to PhdFit.Sign in

About

AnHai Doan is the Vilas Distinguished Achievement Professor and Gurindar S. Sohi Professor of Computer Science at the University of Wisconsin–Madison. His research goal is to make messy data usable at scale. He works on data integration, data science, and machine learning, building end-to-end systems that are deployed in real-world settings. He has received the ACM Doctoral Dissertation Award, NSF CAREER Award, and Sloan Fellowship, and co-authored Principles of Data Integration, a widely used textbook. AnHai has worked extensively at the intersection of academia and industry, serving on the advisory board of Transformic (acquired by Google), as Chief Scientist at Kosmix (acquired by Walmart), and co-founding GreenBay Technologies (acquired by Informatica). He has also served on the SIGMOD Advisory and Executive Committees and was Co-Chair of SIGMOD 2020. His background includes growing up in Vietnam, studying in Hungary, and earning his Ph.D. from the University of Washington in 2002. His career spans roles as a graduate student, professor, startup employee, and big-company employee, with interests outside of work in architecture, history, art, interior design, traveling, and long-distance hiking.

Research topics

Computer Science
Data Mining
Artificial Intelligence
World Wide Web
Data science
Database
Theoretical computer science
Algorithm
Mathematics

Selected publications

Columbo: Expanding Abbreviated Column Names for Tabular Data Using Large Language Models
2025-01-01
articleOpen accessSenior author
Expanding the abbreviated column names of tables, such as "esal" to "employee salary", is critical for many downstream NLP tasks for tabular data, such as NL2SQL, table QA, and keyword search.This problem arises in enterprises, domain sciences, government agencies, and more.In this paper, we make three contributions that significantly advance the state of the art.First, we show that the synthetic public data used by prior work has major limitations, and we introduce four new datasets in enterprise/science domains, with real-world abbreviations.Second, we show that accuracy measures used by prior work seriously undercount correct expansions, and we propose new synonym-aware measures that capture accuracy much more accurately.Finally, we develop Columbo, a powerful LLM-based solution that exploits context, rules, chain-of-thought reasoning, and token-level analysis.Extensive experiments show that Columbo significantly outperforms NameGuess, the current most advanced solution, by 4-29%, over five datasets.Columbo has been used in production on EDI, a major data lake for environmental sciences.
Publisher OA PDF DOI
Technical Perspective: Unicorn: A Unified Multi-Tasking Matching Model
ACM SIGMOD Record · 2024-05-14
article1st authorCorresponding
Data integration has been a long-standing challenge for data management. It has recently received significant attention due to at least three main reasons. First, many data science projects require integrating data from disparate sources before analysis can be carried out to extract insights. Second, many organizations want to build knowledge graphs, such as Customer 360s, Product 360s, and Supplier 360s, which capture all available information about the customers, products, and suppliers of an organization. Building such knowledge graphs often requires integrating data from multiple sources. Finally, there is also an increasing need to integrate a massive amount of data to create training data for AI models, such as large language models.
Publisher DOI
Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching
Proceedings of the VLDB Endowment · 2023-02-01 · 29 citations
articleSenior author
Blocking is a major task in entity matching. Numerous blocking solutions have been developed, but as far as we can tell, blocking using the well-known tf/idf measure has received virtually no attention. Yet, when we experimented with tf/idf blocking using Lucene, we found it did quite well. So in this paper we examine tf/idf blocking in depth. We develop Sparkly, which uses Lucene to perform top-k tf/idf blocking in a distributed share-nothing fashion on a Spark cluster. We develop techniques to identify good attributes and tokenizers that can be used to block on, making Sparkly completely automatic. We perform extensive experiments showing that Sparkly outperforms 8 state-of-the-art blockers. Finally, we provide an in-depth analysis of Sparkly's performance, regarding both recall/output size and runtime. Our findings suggest that (a) tf/idf blocking needs more attention, (b) Sparkly forms a strong baseline that future blocking work should compare against, and (c) future blocking work should seriously consider top-k blocking, which helps improve recall, and a distributed share-nothing architecture, which helps improve scalability, predictability, and extensibility.
Publisher DOI
Effective entity matching with transformers
The VLDB Journal · 2023-01-17 · 28 citations
article
Publisher DOI
Cloud data systems
Proceedings of the VLDB Endowment · 2022-08-01
article
The panel will discuss the research opportunities for the database research community in the context of cloud native data services.
Publisher DOI
The Seattle report on database research
Communications of the ACM · 2022 · 43 citations
- Computer Science
- Computer Science
- Database
Every five years, a group of the leading database researchers meet to reflect on their community's impact on the computing industry as well as examine current research challenges.
Publisher OA PDF DOI
Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization
2022 IEEE International Conference on Big Data (Big Data) · 2022-12-17 · 2 citations
articleSenior author
Many applications need to clean data with a target accuracy, e.g., with at least 95% precision. As far as we know, this problem has not been studied in depth. In this paper we take the first step toward solving it. We focus on value normalization (VN), the problem of replacing all strings that refer to the same entity with a unique string. VN is ubiquitous, and we often want to do VN with 100% accuracy. This is typically done today in industry by automatically clustering the strings then asking a user to verify and clean the clusters, until reaching 100% accuracy. This solution has significant limitations. It does not tell the users how to verify and clean the clusters. So the users often take ad-hoc, suboptimal, or incorrect actions. Verifying and cleaning also often take a lot of time, e.g., days. Further, there is no effective way for multiple users to collaboratively verify and clean. In this paper we address these challenges. Overall, our work advances the state of the art in data cleaning by introducing a novel cleaning problem and describing a promising solution template.
Publisher DOI
DEEP LEARNING FOR SEMANTIC MATCHING: A SURVEY
Journal of Computer Science and Cybernetics · 2021-10-12 · 3 citations
articleOpen accessSenior author
Semantic matching finds certain types of semantic relationships among schema/data constructs. Examples include entity matching, entity linking, coreference resolution, schema/ontology matching, semantic text similarity, textual entailment, question answering, tagging, etc. Semantic matching has received much attention in the database, AI, KDD, Web, and Semantic Web communities. Recently, many works have also applied deep learning (DL) to semantic matching. In this paper we survey this fast growing topic. We define the semantic matching problem, categorize its variations into a taxonomy, and describe important applications. We describe DL solutions for important variations of semantic matching. Finally, we discuss future R\&D directions.
Publisher OA PDF DOI
Deep learning for blocking in entity matching
Proceedings of the VLDB Endowment · 2021 · 77 citations
Senior authorCorresponding
- Computer Science
- Computer Science
- Artificial Intelligence
Entity matching (EM) finds data instances that refer to the same real-world entity. Most EM solutions perform blocking then matching. Many works have applied deep learning (DL) to matching, but far fewer works have applied DL to blocking. These blocking works are also limited in that they consider only a simple form of DL and some of them require labeled training data. In this paper, we develop the DeepBlocker framework that significantly advances the state of the art in applying DL to blocking for EM. We first define a large space of DL solutions for blocking, which contains solutions of varying complexity and subsumes most previous works. Next, we develop eight representative solutions in this space. These solutions do not require labeled training data and exploit recent advances in DL (e.g., sequence modeling, transformer, self supervision). We empirically determine which solutions perform best on what kind of datasets (structured, textual, or dirty). We show that the best solutions (among the above eight) outperform the best existing DL solution and the best existing non-DL solutions (including a state-of-the-art industrial non-DL solution), on dirty and textual data, and are comparable on structured data. Finally, we show that the combination of the best DL and non-DL solutions can perform even better, suggesting a new venue for research.
Publisher DOI
Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization
arXiv (Cornell University) · 2021-01-13
preprintOpen accessSenior author
Many applications need to clean data with a target accuracy. As far as we know, this problem has not been studied in depth. In this paper we take the first step toward solving it. We focus on value normalization (VN), the problem of replacing all string that refer to the same entity with a unique string. VN is ubiquitous, and we often want to do VN with 100% accuracy. This is typically done today in industry by automatically clustering the strings then asking a user to verify and clean the clusters, until reaching 100% accuracy. This solution has significant limitations. It does not tell the users how to verify and clean the clusters. This part also often takes a lot of time, e.g., days. Further, there is no effective way for multiple users to collaboratively verify and clean. In this paper we address these challenges. Overall, our work advances the state of the art in data cleaning by introducing a novel cleaning problem and describing a promising solution template.
Publisher OA PDF DOI

Recent grants

III: Medium: Enabling Technologies for 21st Century Entity Matching Applications
NSF · $1.1M · 2016–2021
CAREER: Evolving and Self-Managing Data Integration Systems
NSF · $238k · 2006–2010

Frequent coauthors

Alon Halevy
49 shared
Zachary G. Ives
University of Pennsylvania
22 shared
Raghu Ramakrishnan
20 shared
Jeffrey F. Naughton
19 shared
Peter Haddawy
Mahidol University
17 shared
Warren Shen
13 shared
Sanjib Das
Jadavpur University
13 shared
Pedro Domingos
Instituto de Tecnología Química
13 shared

Labs

UW-Madison Database GroupPI

Awards & honors

Gurindar S. Sohi Professorship, 2020
Vilas Distinguished Achievement Professorship, 2018
Alfred Sloan Research Fellowship, 2007
NSF CAREER Award, 2004
ACM Doctoral Dissertation Award, 2003

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with AnHai Doan

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you