
AnHai Doan
· Vilas Distinguished Achievement Professor; Gurindar S. Sohi ProfessorUniversity of Wisconsin-Madison · Computer Sciences
Active 1994–2025
About
AnHai Doan is the Vilas Distinguished Achievement Professor and Gurindar S. Sohi Professor of Computer Science at the University of Wisconsin–Madison. His research goal is to make messy data usable at scale. He works on data integration, data science, and machine learning, building end-to-end systems that are deployed in real-world settings. He has received the ACM Doctoral Dissertation Award, NSF CAREER Award, and Sloan Fellowship, and co-authored Principles of Data Integration, a widely used textbook. AnHai has worked extensively at the intersection of academia and industry, serving on the advisory board of Transformic (acquired by Google), as Chief Scientist at Kosmix (acquired by Walmart), and co-founding GreenBay Technologies (acquired by Informatica). He has also served on the SIGMOD Advisory and Executive Committees and was Co-Chair of SIGMOD 2020. His background includes growing up in Vietnam, studying in Hungary, and earning his Ph.D. from the University of Washington in 2002. His career spans roles as a graduate student, professor, startup employee, and big-company employee, with interests outside of work in architecture, history, art, interior design, traveling, and long-distance hiking.
Research topics
- Computer Science
- Data Mining
- Artificial Intelligence
- World Wide Web
- Data science
- Database
- Theoretical computer science
- Algorithm
- Mathematics
Selected publications
Columbo: Expanding Abbreviated Column Names for Tabular Data Using Large Language Models
2025-01-01
articleOpen accessSenior authorExpanding the abbreviated column names of tables, such as "esal" to "employee salary", is critical for many downstream NLP tasks for tabular data, such as NL2SQL, table QA, and keyword search.This problem arises in enterprises, domain sciences, government agencies, and more.In this paper, we make three contributions that significantly advance the state of the art.First, we show that the synthetic public data used by prior work has major limitations, and we introduce four new datasets in enterprise/science domains, with real-world abbreviations.Second, we show that accuracy measures used by prior work seriously undercount correct expansions, and we propose new synonym-aware measures that capture accuracy much more accurately.Finally, we develop Columbo, a powerful LLM-based solution that exploits context, rules, chain-of-thought reasoning, and token-level analysis.Extensive experiments show that Columbo significantly outperforms NameGuess, the current most advanced solution, by 4-29%, over five datasets.Columbo has been used in production on EDI, a major data lake for environmental sciences.
Technical Perspective: Unicorn: A Unified Multi-Tasking Matching Model
ACM SIGMOD Record · 2024-05-14
article1st authorCorrespondingData integration has been a long-standing challenge for data management. It has recently received significant attention due to at least three main reasons. First, many data science projects require integrating data from disparate sources before analysis can be carried out to extract insights. Second, many organizations want to build knowledge graphs, such as Customer 360s, Product 360s, and Supplier 360s, which capture all available information about the customers, products, and suppliers of an organization. Building such knowledge graphs often requires integrating data from multiple sources. Finally, there is also an increasing need to integrate a massive amount of data to create training data for AI models, such as large language models.
Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching
Proceedings of the VLDB Endowment · 2023-02-01 · 29 citations
articleSenior authorBlocking is a major task in entity matching. Numerous blocking solutions have been developed, but as far as we can tell, blocking using the well-known tf/idf measure has received virtually no attention. Yet, when we experimented with tf/idf blocking using Lucene, we found it did quite well. So in this paper we examine tf/idf blocking in depth. We develop Sparkly, which uses Lucene to perform top-k tf/idf blocking in a distributed share-nothing fashion on a Spark cluster. We develop techniques to identify good attributes and tokenizers that can be used to block on, making Sparkly completely automatic. We perform extensive experiments showing that Sparkly outperforms 8 state-of-the-art blockers. Finally, we provide an in-depth analysis of Sparkly's performance, regarding both recall/output size and runtime. Our findings suggest that (a) tf/idf blocking needs more attention, (b) Sparkly forms a strong baseline that future blocking work should compare against, and (c) future blocking work should seriously consider top-k blocking, which helps improve recall, and a distributed share-nothing architecture, which helps improve scalability, predictability, and extensibility.
Effective entity matching with transformers
The VLDB Journal · 2023-01-17 · 28 citations
articleProceedings of the VLDB Endowment · 2022-08-01
articleThe panel will discuss the research opportunities for the database research community in the context of cloud native data services.
The Seattle report on database research
Communications of the ACM · 2022 · 43 citations
- Computer Science
- Computer Science
- Database
Every five years, a group of the leading database researchers meet to reflect on their community's impact on the computing industry as well as examine current research challenges.
Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization
2022 IEEE International Conference on Big Data (Big Data) · 2022-12-17 · 2 citations
articleSenior authorMany applications need to clean data with a target accuracy, e.g., with at least 95% precision. As far as we know, this problem has not been studied in depth. In this paper we take the first step toward solving it. We focus on value normalization (VN), the problem of replacing all strings that refer to the same entity with a unique string. VN is ubiquitous, and we often want to do VN with 100% accuracy. This is typically done today in industry by automatically clustering the strings then asking a user to verify and clean the clusters, until reaching 100% accuracy. This solution has significant limitations. It does not tell the users how to verify and clean the clusters. So the users often take ad-hoc, suboptimal, or incorrect actions. Verifying and cleaning also often take a lot of time, e.g., days. Further, there is no effective way for multiple users to collaboratively verify and clean. In this paper we address these challenges. Overall, our work advances the state of the art in data cleaning by introducing a novel cleaning problem and describing a promising solution template.
DEEP LEARNING FOR SEMANTIC MATCHING: A SURVEY
Journal of Computer Science and Cybernetics · 2021-10-12 · 3 citations
articleOpen accessSenior authorSemantic matching finds certain types of semantic relationships among schema/data constructs. Examples include entity matching, entity linking, coreference resolution, schema/ontology matching, semantic text similarity, textual entailment, question answering, tagging, etc. Semantic matching has received much attention in the database, AI, KDD, Web, and Semantic Web communities. Recently, many works have also applied deep learning (DL) to semantic matching. In this paper we survey this fast growing topic. We define the semantic matching problem, categorize its variations into a taxonomy, and describe important applications. We describe DL solutions for important variations of semantic matching. Finally, we discuss future R\&D directions.
Deep learning for blocking in entity matching
Proceedings of the VLDB Endowment · 2021 · 77 citations
Senior authorCorresponding- Computer Science
- Computer Science
- Artificial Intelligence
Entity matching (EM) finds data instances that refer to the same real-world entity. Most EM solutions perform blocking then matching. Many works have applied deep learning (DL) to matching, but far fewer works have applied DL to blocking. These blocking works are also limited in that they consider only a simple form of DL and some of them require labeled training data. In this paper, we develop the DeepBlocker framework that significantly advances the state of the art in applying DL to blocking for EM. We first define a large space of DL solutions for blocking, which contains solutions of varying complexity and subsumes most previous works. Next, we develop eight representative solutions in this space. These solutions do not require labeled training data and exploit recent advances in DL (e.g., sequence modeling, transformer, self supervision). We empirically determine which solutions perform best on what kind of datasets (structured, textual, or dirty). We show that the best solutions (among the above eight) outperform the best existing DL solution and the best existing non-DL solutions (including a state-of-the-art industrial non-DL solution), on dirty and textual data, and are comparable on structured data. Finally, we show that the combination of the best DL and non-DL solutions can perform even better, suggesting a new venue for research.
Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization
arXiv (Cornell University) · 2021-01-13
preprintOpen accessSenior authorMany applications need to clean data with a target accuracy. As far as we know, this problem has not been studied in depth. In this paper we take the first step toward solving it. We focus on value normalization (VN), the problem of replacing all string that refer to the same entity with a unique string. VN is ubiquitous, and we often want to do VN with 100% accuracy. This is typically done today in industry by automatically clustering the strings then asking a user to verify and clean the clusters, until reaching 100% accuracy. This solution has significant limitations. It does not tell the users how to verify and clean the clusters. This part also often takes a lot of time, e.g., days. Further, there is no effective way for multiple users to collaboratively verify and clean. In this paper we address these challenges. Overall, our work advances the state of the art in data cleaning by introducing a novel cleaning problem and describing a promising solution template.
Recent grants
III: Medium: Enabling Technologies for 21st Century Entity Matching Applications
NSF · $1.1M · 2016–2021
CAREER: Evolving and Self-Managing Data Integration Systems
NSF · $238k · 2006–2010
Frequent coauthors
- 49 shared
Alon Halevy
- 22 shared
Zachary G. Ives
University of Pennsylvania
- 20 shared
Raghu Ramakrishnan
- 19 shared
Jeffrey F. Naughton
- 17 shared
Peter Haddawy
Mahidol University
- 13 shared
Warren Shen
- 13 shared
Sanjib Das
Jadavpur University
- 13 shared
Pedro Domingos
Instituto de Tecnología Química
Labs
Awards & honors
- Gurindar S. Sohi Professorship, 2020
- Vilas Distinguished Achievement Professorship, 2018
- Alfred Sloan Research Fellowship, 2007
- NSF CAREER Award, 2004
- ACM Doctoral Dissertation Award, 2003
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with AnHai Doan
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup