
Tatsunori Hashimoto
· Machine Learning & Natural Language ProcessingVerifiedStanford University · Learning, Design, and Technology
Active 1980–2026
About
Tatsunori Hashimoto is a researcher focused on the analysis of large unweighted directed graphs, which are commonly used to capture relations between entities. His work addresses the fundamental problem of defining similarity or dissimilarity between vertices in such networks. Hashimoto has contributed to the development of a class of techniques for analyzing random walks on graphs using stochastic calculus. These techniques enable the generalization of results on the degeneracy of hitting times and the analysis of a metric based on the Laplace transformed hitting time (LTHT). This metric serves as a natural and provably well-behaved alternative to the expected hitting time. Hashimoto's research establishes a general correspondence between hitting times of Brownian motion and analogous hitting times on graphs. He demonstrates that the LTHT metric is consistent with the underlying metric of a geometric graph, preserves clustering tendencies, and remains robust against the random addition of non-geometric edges. His work includes tests on simulated and real-world data, showing that the LTHT matches theoretical predictions and outperforms alternative metrics.
Research topics
- Computer Science
- Artificial Intelligence
- Computational biology
- Genetics
- Engineering
- Natural Language Processing
- Political Science
- Biology
- Geology
- Management science
- Data science
- Physics
- Psychology
- Evolutionary biology
- Engineering ethics
- Law
- Mathematics
- Cognitive psychology
Selected publications
Towards Execution-Grounded Automated AI Research
arXiv (Cornell University) · 2026-01-20
preprintOpen accessSenior authorAutomated AI research holds great potential to accelerate scientific discovery. However, current LLMs often generate plausible-looking but ineffective ideas. Execution grounding may help, but it is unclear whether automated execution is feasible and whether LLMs can learn from the execution feedback. To investigate these, we first build an automated executor to implement ideas and launch large-scale parallel GPU experiments to verify their effectiveness. We then convert two realistic research problems - LLM pre-training and post-training - into execution environments and demonstrate that our automated executor can implement a large fraction of the ideas sampled from frontier LLMs. We analyze two methods to learn from the execution feedback: evolutionary search and reinforcement learning. Execution-guided evolutionary search is sample-efficient: it finds a method that significantly outperforms the GRPO baseline (69.4% vs 48.0%) on post-training, and finds a pre-training recipe that outperforms the nanoGPT baseline (19.7 minutes vs 35.9 minutes) on pre-training, all within just ten search epochs. Frontier LLMs often generate meaningful algorithmic ideas during search, but they tend to saturate early and only occasionally exhibit scaling trends. Reinforcement learning from execution reward, on the other hand, suffers from mode collapse. It successfully improves the average reward of the ideator model but not the upper-bound, due to models converging on simple ideas. We thoroughly analyze the executed ideas and training dynamics to facilitate future efforts towards execution-grounded automated AI research.
Towards Execution-Grounded Automated AI Research
ArXiv.org · 2026-01-20
articleOpen accessSenior authorAutomated AI research holds great potential to accelerate scientific discovery. However, current LLMs often generate plausible-looking but ineffective ideas. Execution grounding may help, but it is unclear whether automated execution is feasible and whether LLMs can learn from the execution feedback. To investigate these, we first build an automated executor to implement ideas and launch large-scale parallel GPU experiments to verify their effectiveness. We then convert two realistic research problems - LLM pre-training and post-training - into execution environments and demonstrate that our automated executor can implement a large fraction of the ideas sampled from frontier LLMs. We analyze two methods to learn from the execution feedback: evolutionary search and reinforcement learning. Execution-guided evolutionary search is sample-efficient: it finds a method that significantly outperforms the GRPO baseline (69.4% vs 48.0%) on post-training, and finds a pre-training recipe that outperforms the nanoGPT baseline (19.7 minutes vs 35.9 minutes) on pre-training, all within just ten search epochs. Frontier LLMs often generate meaningful algorithmic ideas during search, but they tend to saturate early and only occasionally exhibit scaling trends. Reinforcement learning from execution reward, on the other hand, suffers from mode collapse. It successfully improves the average reward of the ideator model but not the upper-bound, due to models converging on simple ideas. We thoroughly analyze the executed ideas and training dynamics to facilitate future efforts towards execution-grounded automated AI research.
Data-efficient pre-training by scaling synthetic megadocs
ArXiv.org · 2026-03-19
articleOpen accessSynthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different distribution. With optimal mixing and epoching, loss and benchmark accuracy improve without overfitting as the number of synthetic generations grows, plateauing near $1.48\times$ data efficiency at 32 rephrases per document. We find even better loss scaling under a new perspective: synthetic generations from the same document can form a single substantially longer megadocument instead of many short documents. We show two ways to construct megadocs: stitching synthetic rephrases from the same web document or stretching a document by inserting rationales. Both methods improve i.i.d. loss, downstream benchmarks, and especially long-context loss relative to simple rephrasing, increasing data efficiency from $1.48\times$ to $1.80\times$ at $32$ generations per document. Importantly, the improvement of megadocs over simple rephrasing widens as more synthetic data is generated. Our results show how to design synthetic data algorithms that benefit more from increasing compute when data-constrained.
Data-efficient pre-training by scaling synthetic megadocs
arXiv (Cornell University) · 2026-03-19
preprintOpen accessSynthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different distribution. With optimal mixing and epoching, loss and benchmark accuracy improve without overfitting as the number of synthetic generations grows, plateauing near $1.48\times$ data efficiency at 32 rephrases per document. We find even better loss scaling under a new perspective: synthetic generations from the same document can form a single substantially longer megadocument instead of many short documents. We show two ways to construct megadocs: stitching synthetic rephrases from the same web document or stretching a document by inserting rationales. Both methods improve i.i.d. loss, downstream benchmarks, and especially long-context loss relative to simple rephrasing, increasing data efficiency from $1.48\times$ to $1.80\times$ at $32$ generations per document. Importantly, the improvement of megadocs over simple rephrasing widens as more synthetic data is generated. Our results show how to design synthetic data algorithms that benefit more from increasing compute when data-constrained.
End-to-End Test-Time Training for Long Context
arXiv (Cornell University) · 2025-12-29
preprintOpen accessWe formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.
End-to-End Test-Time Training for Long Context
ArXiv.org · 2025-12-29
articleOpen accessWe formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.
2025-01-01 · 33 citations
articleOpen accessSenior authorNiklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, Tatsunori Hashimoto. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025.
PubMed · 2025-12-01
article1st authorCorrespondingThe patient was a 68-year-old man. The diagnosis was Bismuth type 1 hilar cholangiocarcinoma with right hepatic arterial infiltration. Right hepatectomy with extra hepatic bile duct resection was planned. Computed tomography(CT)volumetry using SYNAPSE VINCENT revealed the volume of entire liver to be 1,417 mL, future liver remnant volume(FLR-V)to be 390 mL(28%), and estimated indocyanine green clearance rate of future liver remnant(ICG-Krem)to be 0.043. It was too small, so portal vein embolization(PVE)was performed, and 4 days after PVE, hepatic vein embolization(HVE)was done. Fourteen days after PVE, FLR-V increased by 472 mL(34%), and future liver remnant function(FLR-f)assessed by single- photon emission computed tomography(SPECT)increased by 563 mL(41%). Twenty-eight days after PVE, FLR-V increased by 573 mL(41.8%), FLR-f 648 mL(47%), ICG-Krem 0.064, and functional ICG-Krem assessed by SPECT 0.073. Thirty- two days after PVE, right hepatectomy with extra hepatic bile duct resection was performed. Surgery time was 764 min, and blood loss was 1,900 mL. The cancer was pathologically diagnosed as moderately differentiated adenocarcinoma, pT2a, pN0, pM0, pStageⅡ, R0 resected. The patient's postoperative course was uneventful, and he discharged 23 days after surgery. He started to receive adjuvant S-1(120 mg)chemotherapy 28 days after surgery, and there has been no evidence of recurrence for 4 months.
CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition
ArXiv.org · 2025-02-03
preprintOpen accessModern deep learning models often achieve high overall performance, but consistently fail on specific subgroups. Group distributionally robust optimization (group DRO) addresses this problem by minimizing the worst-group loss, but it fails when group losses misrepresent performance differences between groups. This is common in domains like speech, where the widely used connectionist temporal classification (CTC) loss not only scales with input length but also varies with linguistic and acoustic properties, leading to spurious differences between group losses. We present CTC-DRO, which addresses the shortcomings of the group DRO objective by smoothing the group weight update to prevent overemphasis on consistently high-loss groups, while using input length-matched batching to mitigate CTC's scaling issues. We evaluate CTC-DRO on the task of multilingual automatic speech recognition (ASR) across five language sets from the diverse ML-SUPERB 2.0 benchmark. CTC-DRO consistently outperforms group DRO and CTC-based baseline models, reducing the worst-language error by up to 47.1% and the average error by up to 32.9%. CTC-DRO can be applied to ASR with minimal computational costs, and, while motivated by multilingual ASR, offers the potential for reducing group disparities in other domains with similar challenges.
A case of an alpha-fetoprotein-producing intrahepatic cholangiocarcinoma
Clinical Journal of Gastroenterology · 2025-04-10
review
Frequent coauthors
- 46 shared
Percy Liang
- 25 shared
Daniel Kang
- 19 shared
David K. Gifford
Massachusetts Institute of Technology
- 19 shared
Richard I. Sherwood
Brigham and Women's Hospital
- 17 shared
Tommi Jaakkola
- 16 shared
Faisal Ladhak
- 15 shared
Jiro Yoshida
Pennsylvania State University
- 15 shared
Esin Durmus
Labs
Research in robust and trustworthy machine learning systems, particularly in complex systems like large language models.
Education
B.A., Statistics and Math
Harvard
M.S.
MIT
Ph.D.
MIT
Awards & honors
- Best paper runner up at ICML 2018
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Tatsunori Hashimoto
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup