Raul Castro Fernandez

· Assistant Professor of Computer ScienceVerified

University of Chicago · Computer Science

Active 1970–2025

h-index28

Citations3.3k

Papers13957 last 5y

Funding—

Faculty page Lab page Website

See your match with Raul Castro Fernandez — sign in to PhdFit.Sign in

About

Raul Castro Fernandez is an Assistant Professor of Computer Science at the University of Chicago. His research focuses on data management, data science, databases, systems, and the value of data. He is interested in understanding how to make the best use of data by building systems to share, discover, prepare, integrate, and process data, often utilizing techniques from data management, statistics, and machine learning. Fernandez is part of ChiData, the data systems research group at The University of Chicago, where he conducts research on large-scale video analysis, efficient data processing systems, and the economics of data. His work aims to advance the understanding of data sharing markets and improve data utilization, contributing to the fields of systems, architecture, and networking.

Research topics

Computer Science
Data Mining
Artificial Intelligence
Information Retrieval
Computer Security
Machine Learning
Data science
Database
World Wide Web
Distributed computing
Business
Operating system
Economics
Finance
Computer network

Selected publications

Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking
Lecture notes in computer science · 2025-11-23
book-chapter
Publisher DOI
What is the Value of Data? A Theory and Systematization
ACM / IMS Journal of Data Science · 2025-03-31
articleOpen access1st authorCorresponding
Data powers economies, shapes societies, and fuels decision-making, yet its value remains poorly understood. Despite its centrality, we lack a unified framework for defining, measuring, and reasoning about data’s worth. This article develops a theory and systematization of the value of data—explaining why, how, and when data generates value. We distinguish data from documents, separate objective value from subjective judgments, and identify key dimensions of data’s worth. Our framework reconciles disparate notions of information, knowledge, and utility, offering insights that validate known principles while uncovering new opportunities to extract value from data. More than a taxonomy, this work provides a conceptual foundation for integrating perspectives from computer science, economics, and beyond. The conceptual foundation clarifies data’s role in technology, markets, and governance, advancing our ability to systematically understand and harness its value.
Publisher DOI
<scp>Pneuma</scp> : Leveraging LLMs for Tabular Data Representation and Retrieval in an End-to-End System
Proceedings of the ACM on Management of Data · 2025-06-17 · 6 citations
articleOpen accessSenior author
Finding relevant tables among databases, lakes, and repositories is the first step in extracting value from data. Such a task remains difficult because assessing whether a table is relevant to a problem does not always depend only on its content but also on the context, which is usually tribal knowledge known to the individual or team. While tools like data catalogs and academic data discovery systems target this problem, they rely on keyword search or more complex interfaces, limiting non-technical users' ability to find relevant data. The advent of large language models (LLMs) offers a unique opportunity for users to ask questions directly in natural language, making dataset discovery more intuitive, accessible, and efficient. In this paper, we introduce Pneuma , a retrieval-augmented generation (RAG) system designed to efficiently and effectively discover tabular data. Pneuma leverages large language models (LLMs) for both table representation and table retrieval. For table representation, Pneuma preserves schema and row-level information to ensure comprehensive data understanding. For table retrieval, Pneuma augments LLMs with traditional information retrieval techniques, such as full-text and vector search, harnessing the strengths of both to improve retrieval performance. To evaluate Pneuma , we generate comprehensive benchmarks that simulate table discovery workload on six real-world datasets including enterprise data, scientific databases, warehousing data, and open data. Our results demonstrate that Pneuma outperforms widely used table search systems (such as full-text search and state-of-the-art RAG systems) in accuracy and resource efficiency.
Publisher DOI
Where Does Academic Database Research Go From Here?
ArXiv.org · 2025-04-11
preprintOpen accessSenior author
Panel proposal for an open forum to discuss and debate the future of database research in the context of industry, other research communities, and AI. Includes summaries of past panels, positions from panelists, as well as positions from a sample of the data management community.
Publisher OA PDF DOI
Where Does Academic Database Research Go from Here?
Proceedings of the VLDB Endowment · 2025-08-01
articleSenior author
An open forum to discuss and debate the future of database research in the context of industry, other research communities, and AI.
Publisher DOI
Not-So-Bitter Pill to Swallow: Slipstreaming Memory Safe Programming via Rust as part of a Database Systems Course
2025-06-22
articleOpen accessSenior author
Publisher OA PDF DOI
Core Hours and Carbon Credits: Incentivizing Sustainability in HPC
2025-11-12 · 2 citations
articleOpen access
Efforts to reduce the environmental impact of HPC often focus on resource providers, but choices made by users, e.g., concerning where to run, can be equally consequential. Here we present evidence that new accounting methods that charge users for energy used can incentivize significantly more efficient behavior. We first survey 300 HPC users and find that fewer than 30% are aware of their energy consumption, and that energy efficiency is a low priority concern. We then propose two new multi-resource accounting methods that charge for computations based on their energy consumption or carbon footprint, respectively. Finally, we conduct both simulation studies and a user study to evaluate the impact of these two methods on user behavior. We find that while only providing users feedback on their energy use had no impact on their behavior, associating energy with cost incentivized users to select more efficient resources, and use 40% less energy.
Publisher DOI
Core Hours and Carbon Credits: Incentivizing Sustainability in HPC
arXiv (Cornell University) · 2025-01-16
preprintOpen access
Realizing a shared responsibility between providers and consumers is critical to manage the sustainability of HPC. However, while cost may motivate efficiency improvements by infrastructure operators, broader progress is impeded by a lack of user incentives. We conduct a survey of HPC users that reveals fewer than 30 percent are aware of their energy consumption, and that energy efficiency is among users' lowest priority concerns. One explanation is that existing pricing models may encourage users to prioritize performance over energy efficiency. We propose two transparent multi-resource pricing schemes, Energy- and Carbon-Based Accounting, that seek to change this paradigm by incentivizing more efficient user behavior. These two schemes charge for computations based on their energy consumption or carbon footprint, respectively, rewarding users who leverage efficient hardware and software. We evaluate these two pricing schemes via simulation, in a prototype, and a user study.
Publisher OA PDF DOI
Exploiting LLMs for Automatic Hypothesis Assessment via a Logit-Based Calibrated Prior
ArXiv.org · 2025-06-03
preprintOpen accessSenior author
As hypothesis generation becomes increasingly automated, a new bottleneck has emerged: hypothesis assessment. Modern systems can surface thousands of statistical relationships-correlations, trends, causal links-but offer little guidance on which ones are novel, non-trivial, or worthy of expert attention. In this work, we study the complementary problem to hypothesis generation: automatic hypothesis assessment. Specifically, we ask: given a large set of statistical relationships, can we automatically assess which ones are novel and worth further exploration? We focus on correlations as they are a common entry point in exploratory data analysis that often serve as the basis for forming deeper scientific or causal hypotheses. To support automatic assessment, we propose to leverage the vast knowledge encoded in LLMs' weights to derive a prior distribution over the correlation value of a variable pair. If an LLM's prior expects the correlation value observed, then such correlation is not surprising, and vice versa. We propose the Logit-based Calibrated Prior, an LLM-elicited correlation prior that transforms the model's raw output logits into a calibrated, continuous predictive distribution over correlation values. We evaluate the prior on a benchmark of 2,096 real-world variable pairs and it achieves a sign accuracy of 78.8%, a mean absolute error of 0.26, and 95% credible interval coverage of 89.2% in predicting Pearson correlation coefficient. It also outperforms a fine-tuned RoBERTa classifier in binary correlation prediction and achieves higher precision@K in hypothesis ranking. We further show that the prior generalizes to correlations not seen during LLM pretraining, reflecting context-sensitive reasoning rather than memorization.
Publisher OA PDF DOI
Where Does Academic Database Research Go From Here?
2025-06-17
articleSenior author
Publisher DOI

Frequent coauthors

Samuel Madden
19 shared
Bhuvana Ramabhadran
18 shared
Sainyam Galhotra
Cornell University
12 shared
Peter Pietzuch
10 shared
Michael Stonebraker
Massachusetts Institute of Technology
10 shared
Pranav Subramaniam
10 shared
Ron Hoory
10 shared
Rosalind W. Picard
Massachusetts Institute of Technology
9 shared

Labs

Raul Castro Fernandez LabPI

Education

Ph.D., Computer Science
University of Chicago
2010
M.S., Computer Science
University of Chicago
2006
B.S., Computer Science
University of Chicago
2004

Awards & honors

2025 SLOAN Fellowship
2024 NSF CAREER Award
2023 SIGMOD Test of Time Award

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Raul Castro Fernandez

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you