Juliana Freire

· Professor of Computer Science and Engineering, NYU Tandon; Professor of Data ScienceVerified

New York University · Computer Science

Active 1800–2026

h-index58

Citations12.7k

Papers36269 last 5y

Funding$6.3M

Faculty page

See your match with Juliana Freire — sign in to PhdFit.Sign in

About

Juliana Freire is an Institute Professor at the NYU Tandon School of Engineering and a Professor of Computer Science and Data Science at New York University. Her research develops methods and systems that enable users to obtain trustworthy insights from data, focusing on large-scale data analysis, visualization, machine learning, provenance management, and web information discovery. Her work spans various application areas including urban analytics, predictive modeling, and computational reproducibility. Freire has contributed significantly to the database and web research communities, with over 250 technical papers, several open-source systems, and 12 U.S. patents. She has received numerous honors, including being named an ACM Fellow, AAAS Fellow, and receiving awards such as the ACM SIGMOD Contributions Award and NSF CAREER award. Her research is funded by major agencies and foundations, and she has held academic positions at the University of Utah before joining NYU. Her current work includes developing scalable data collection pipelines and machine learning techniques to monitor and disrupt illegal wildlife trade online, leveraging large-language models for data filtering and analysis.

Research topics

Computer Science
Data science
Information Retrieval
Data Mining
Database
Artificial Intelligence
Mathematics
Programming language
World Wide Web
Machine Learning
Statistics
Engineering
Software engineering
Combinatorics

Selected publications

From FAIR to CURE: guidelines for computational models of biological systems
npj Systems Biology and Applications · 2026-03-27 · 1 citations
articleOpen access
Guidelines for managing scientific data have been established under the FAIR principles, requiring that data be Findable, Accessible, Interoperable, and Reusable. In many scientific disciplines, especially computational biology, both data and models are key to progress. For this reason, and recognizing that such models are a very special type of "data", we argue that computational models, especially mechanistic models prevalent in medicine, physiology and systems biology, deserve a complementary set of guidelines. We propose the CURE principles, emphasizing that models should be Credible, Understandable, Reproducible, and Extensible. We delve into each principle, discussing verification, validation, and uncertainty quantification for model credibility; the clarity of model descriptions and annotations for understandability; adherence to standards and open science practices for reproducibility; and the use of open standards and modular code for extensibility and reuse. We outline recommended and baseline requirements for each aspect of CURE, aiming to enhance the impact and trustworthiness of computational models, particularly in biomedical applications where credibility is paramount. Our perspective underscores the need for a more disciplined approach to modeling, aligning with emerging trends such as Digital Twins and emphasizing the importance of data and modeling standards for interoperability and reuse. Finally, we emphasize that given the non-trivial effort required to implement the guidelines, the community should strive to automate as many of the guidelines as possible.
Publisher OA PDF DOI
BDIViz in Action: Interactive Curation and Benchmarking for Schema Matching Methods
ArXiv.org · 2026-04-12
articleOpen accessSenior author
Schema matching remains fundamental to data integration, yet evaluating and comparing matching methods is hindered by limited benchmark diversity and lack of interactive validation frameworks. BDIViz, recently published at IEEE VIS 2025, is an interactive visualization system for schema matching with LLM-assisted validation. Given source and target datasets, BDIViz applies automatic matching methods and visualizes candidates in an interactive heatmap with hierarchical navigation, zoom, and filtering. Users validate matches directly in the heatmap and inspect ambiguous cases using coordinated views that show attribute descriptions, example values, and distributions. An LLM assistant generates structured explanations for selected candidates to support decision-making. This demonstration showcases a new extension to BDIViz that addresses a critical need in data integration research: human-in-the-loop benchmarking and iterative matcher development. New matchers can be integrated through a standardized interface, while user validations become evolving ground truth for real-time performance evaluation. This enables benchmarking new algorithms, constructing high-quality ground-truth datasets through expert validation, and comparing matcher behavior across diverse schemas and domains. We demonstrate two complementary scenarios: (i) data harmonization, where users map a large tabular dataset to a target schema with value-level inspection and LLM-generated explanations; and (ii) developer-in-the-loop benchmarking, where developers integrate custom matchers, observe performance metrics, and refine their algorithms.
Publisher OA PDF
BDI-Kit Demo: A Toolkit for Programmable and Conversational Data Harmonization
arXiv (Cornell University) · 2026-04-07
preprintOpen accessSenior author
Data harmonization remains a major bottleneck for integrative analysis due to heterogeneity in schemas, value representations, and domain-specific conventions. BDI-Kit provides an extensible toolkit for schema and value matching. It exposes two complementary interfaces tailored to different user needs: a Python API enabling developers to construct harmonization pipelines programmatically, and an AI-assisted chat interface allowing domain experts to harmonize data through natural language dialogue. This demonstration showcases how users interact with BDI-Kit to iteratively explore, validate, and refine schema and value matches through a combination of automated matching, AI-assisted reasoning, and user-driven refinement. We present two scenarios: (i) using the Python API to programmatically compose primitives, examine intermediate outputs, and reuse transformations; and (ii) conversing with the AI assistant in natural language to access BDI-Kit's capabilities and iteratively refine outputs based on the assistant's suggestions.
Publisher DOI
BDI-Kit: An AI-powered toolkit for biomedical data harmonization
Patterns · 2026-02-12
articleOpen accessSenior author
The wide availability of biomedical data, coupled with advanced analytics, holds unprecedented promise for scientific discovery and improved patient care; yet, heterogeneity across datasets remains a major barrier. Given the inherent diversity of biomedical domains, one-size-fits-all solutions are impractical. Despite decades of active research and numerous methods for automating data integration, there is a scarcity of open-source tools capable of handling this complexity. To address these challenges, we introduce Biomedical Data Integration and Harmonization Toolkit (BDI-Kit), an extensible toolkit designed for human-AI collaboration that provides a diverse suite of harmonization methods. It offers two complementary interfaces: a Python API that supports the creation of computational pipelines for harmonization and an AI-assisted chat interface that enables domain experts to perform harmonization using natural language. In this paper, we describe BDI-Kit and demonstrate its capabilities through real-world use cases. By simplifying data harmonization, BDI-Kit empowers researchers and practitioners, facilitating effective exploration and accelerating scientific discovery and clinical research.
Publisher DOI
StraTyper: Automated Semantic Type Discovery and Multi-Type Annotation for Dataset Collections
ArXiv.org · 2026-02-03
articleOpen accessSenior author
Understanding dataset semantics is crucial for effective search, discovery, and integration pipelines. To this end, column type annotation (CTA) methods associate columns of tabular datasets with semantic types that accurately describe their contents, using pre-trained deep learning models or Large Language Models (LLMs). However, existing approaches require users to specify a closed set of semantic types either at training or inference time, hindering their application to domain-specific datasets where pre-defined labels often lack adequate coverage and specificity. Furthermore, real-world datasets frequently contain columns with values belonging to multiple semantic types, violating the single-type assumption of existing CTA methods. While proprietary LLMs have shown effectiveness for CTA, they incur high monetary costs and produce inconsistent outputs for similar columns, leading to type redundancy that negatively affects downstream applications. To address these challenges, we introduce StraTyper, a cost-effective method for column type discovery (CTD) and multi-type annotation (CMTA) in dataset collections. StraTyper eliminates the need for pre-defined semantic labels by systematically employing LLMs to discovery types tailored to the dataset collection at hand. Through strategic column clustering, controlled type generation, and iterative cascading discovery, StraTyper balances type precision with annotation coverage while minimizing LLM costs. Our experimental evaluation-both manual and LLM-assisted-on real-world benchmarks demonstrates that StraTyper discovers accurate types for both numerical and non-numerical data, achieves substantial cost savings compared to commercial LLMs, and effectively handles multi-typed columns. We further show that StraTyper's annotations improve downstream tasks, including join discovery and schema matching, outperforming LLM-only baselines.
Publisher OA PDF
AutoDDG: Automated Dataset Description Generation using Large Language Models
Proceedings of the ACM on Management of Data · 2026-04-02 · 1 citations
preprintOpen accessSenior author
The proliferation of datasets across open data portals and enterprise data lakes presents an opportunity for deriving data-driven insights. Widely-used dataset search systems rely on keyword search over dataset metadata to support discovery. Therefore, when metadata is incomplete, missing, or inconsistent with dataset contents, findability is severely compromised. To address this limitation, we introduce AutoDDG, a framework that automatically generates textual descriptions of tabular data. By adopting a data-driven approach to summarize dataset contents and leveraging large language models (LLMs) to enrich summaries with semantic information and produce human-readable text, AutoDDG derives descriptions that are comprehensive, accurate, readable, and concise. A critical challenge in this problem is evaluating the effectiveness of description generation methods and assessing the quality of the generated descriptions. We propose a comprehensive evaluation methodology that combines retrieval, reference-based, and reference-free assessment, with human validation. Our experimental results using new benchmarks demonstrate that AutoDDG generates high-quality, accurate descriptions at scale, significantly improving dataset retrieval performance across diverse use cases. AutoDDG is available at https://github.com/VIDA-NYU/AutoDDG.
Publisher DOI
BDI-Kit Demo: A Toolkit for Programmable and Conversational Data Harmonization
arXiv (Cornell University) · 2026-04-07
articleOpen accessSenior author
Data harmonization remains a major bottleneck for integrative analysis due to heterogeneity in schemas, value representations, and domain-specific conventions. BDI-Kit provides an extensible toolkit for schema and value matching. It exposes two complementary interfaces tailored to different user needs: a Python API enabling developers to construct harmonization pipelines programmatically, and an AI-assisted chat interface allowing domain experts to harmonize data through natural language dialogue. This demonstration showcases how users interact with BDI-Kit to iteratively explore, validate, and refine schema and value matches through a combination of automated matching, AI-assisted reasoning, and user-driven refinement. We present two scenarios: (i) using the Python API to programmatically compose primitives, examine intermediate outputs, and reuse transformations; and (ii) conversing with the AI assistant in natural language to access BDI-Kit's capabilities and iteratively refine outputs based on the assistant's suggestions.
Publisher OA PDF
StraTyper: Automated Semantic Type Discovery and Multi-Type Annotation for Dataset Collections
Open MIND · 2026-02-03
preprintSenior author
Understanding dataset semantics is crucial for effective search, discovery, and integration pipelines. To this end, column type annotation (CTA) methods associate columns of tabular datasets with semantic types that accurately describe their contents, using pre-trained deep learning models or Large Language Models (LLMs). However, existing approaches require users to specify a closed set of semantic types either at training or inference time, hindering their application to domain-specific datasets where pre-defined labels often lack adequate coverage and specificity. Furthermore, real-world datasets frequently contain columns with values belonging to multiple semantic types, violating the single-type assumption of existing CTA methods. While proprietary LLMs have shown effectiveness for CTA, they incur high monetary costs and produce inconsistent outputs for similar columns, leading to type redundancy that negatively affects downstream applications. To address these challenges, we introduce StraTyper, a cost-effective method for column type discovery (CTD) and multi-type annotation (CMTA) in dataset collections. StraTyper eliminates the need for pre-defined semantic labels by systematically employing LLMs to discovery types tailored to the dataset collection at hand. Through strategic column clustering, controlled type generation, and iterative cascading discovery, StraTyper balances type precision with annotation coverage while minimizing LLM costs. Our experimental evaluation-both manual and LLM-assisted-on real-world benchmarks demonstrates that StraTyper discovers accurate types for both numerical and non-numerical data, achieves substantial cost savings compared to commercial LLMs, and effectively handles multi-typed columns. We further show that StraTyper's annotations improve downstream tasks, including join discovery and schema matching, outperforming LLM-only baselines.
DOI
BDIViz in Action: Interactive Curation and Benchmarking for Schema Matching Methods
arXiv (Cornell University) · 2026-04-12
preprintOpen accessSenior author
Schema matching remains fundamental to data integration, yet evaluating and comparing matching methods is hindered by limited benchmark diversity and lack of interactive validation frameworks. BDIViz, recently published at IEEE VIS 2025, is an interactive visualization system for schema matching with LLM-assisted validation. Given source and target datasets, BDIViz applies automatic matching methods and visualizes candidates in an interactive heatmap with hierarchical navigation, zoom, and filtering. Users validate matches directly in the heatmap and inspect ambiguous cases using coordinated views that show attribute descriptions, example values, and distributions. An LLM assistant generates structured explanations for selected candidates to support decision-making. This demonstration showcases a new extension to BDIViz that addresses a critical need in data integration research: human-in-the-loop benchmarking and iterative matcher development. New matchers can be integrated through a standardized interface, while user validations become evolving ground truth for real-time performance evaluation. This enables benchmarking new algorithms, constructing high-quality ground-truth datasets through expert validation, and comparing matcher behavior across diverse schemas and domains. We demonstrate two complementary scenarios: (i) data harmonization, where users map a large tabular dataset to a target schema with value-level inspection and LLM-generated explanations; and (ii) developer-in-the-loop benchmarking, where developers integrate custom matchers, observe performance metrics, and refine their algorithms.
Publisher DOI
Implicações do uso de inteligência artificial à proteção de dados pessoais no Brasil
Navus - Revista de Gestão e Tecnologia · 2026-02-25
articleOpen access1st authorCorresponding
O uso crescente de inteligência artificial (IA) apresenta desafios significativos para o tratamento de dados pessoais no Brasil, especialmente no contexto da Lei Geral de Proteção de Dados (LGPD). Este estudo tem como objetivo identificar as principais implicações decorrentes do uso de IA para a privacidade e proteção de dados pessoais. A pesquisa proposta adota uma abordagem exploratória e descritiva, começando com uma revisão da literatura para construir o referencial teórico sobre inteligência artificial e decisões automatizadas. Na sequência, realiza-se uma survey com profissionais de proteção de dados, composta por dez afirmações, que os respondentes avaliaram em escala Likert de cinco pontos. Após a análise dos dados, os principais tópicos identificados são aprofundados em entrevistas com especialistas nestes campos. Os resultados apontam que o uso de IA no Brasil levanta preocupações em relação à proteção de dados pessoais, conforme evidenciado pelos desafios de transparência e vieses nos algorítmicos identificados neste estudo. A LGPD, apesar de garantir os direitos de explicação e de revisão das decisões automatizadas, enfrenta limitações devido à falta de definições claras para esses processos, o que pode comprometer a aplicação efetiva da legislação. Esta pesquisa contribui para um entendimento mais profundo das implicações do uso da IA para a privacidade e para a governança responsável dessa tecnologia.
Publisher OA PDF DOI

Recent grants

CT-T: A Laboratory Workbench for Security Research
NSF · $1.5M · 2005–2010
CI-EN: Enhancing and Supporting a Community-Based Data Analysis, Visualization, and Provenance Platform
NSF · $500k · 2014–2017
III-COR: Discovering and Organizing Hidden-Web Sources
NSF · $378k · 2007–2012
III: EAGER: Collaborative Research: A Community Experiment Platform for Reproducibility and Generalizability
NSF · $190k · 2011–2013
Managing Complex Visualizations
NSF · $530k · 2005–2009

Frequent coauthors

Claudio Silva
71 shared
David Koop
University of Massachusetts Dartmouth
47 shared
Huy T. Vo
40 shared
Emanuele Santos
35 shared
Fernando Chirigati
34 shared
Aécio Santos
31 shared
Carlos Scheidegger
26 shared
Steven P. Callahan
Epsilon Systems (United States)
24 shared

Education

Ph.D., Computer Science
Stony Brook University
1997

Awards & honors

ACM SIGMOD Contributions Award (2020)
ACM Fellow (2014)
AAAS Fellow (2021)
Google Faculty Research Award (2013)
IBM Faculty Award (2008)

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Juliana Freire

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you