Ashok Cutkosky

· Assistant Professor – Electrical & Computer Engineering Affiliated Faculty – Computer ScienceVerified

Boston University · Computer Science

Active 2009–2025

h-index12

Citations2.1k

Papers9559 last 5y

Funding—

Faculty page Lab page Website

See your match with Ashok Cutkosky — sign in to PhdFit.Sign in

About

Ashok Cutkosky is an assistant professor in the Electrical and Computer Engineering (ECE) department at Boston University. Prior to his current academic position, he worked as a research scientist at Google. He earned his PhD in computer science from Stanford University in 2018 under the supervision of Kwabena Boahen. Additionally, he holds an AB in mathematics from Harvard University, obtained in 2013, and is also a master of medicine. His research interests focus on optimization algorithms for machine learning, with recent work on non-convex optimization and adaptive online learning. Cutkosky has contributed extensively to the field through numerous publications and has been involved in teaching courses related to machine learning and optimization at Boston University.

Research topics

Computer Science
Statistics
Algorithm
Artificial Intelligence
Mathematics
Machine Learning
Mathematical analysis
Theoretical computer science
Physics
Mathematical optimization

Selected publications

Pretraining Improves Prediction of Genomic Datasets Across Species
bioRxiv (Cold Spring Harbor Laboratory) · 2025-08-24
preprintOpen accessSenior authorCorresponding
Recent studies suggest that deep neural network models trained on thousands of human genomic datasets can accurately predict genomic features, including gene expression and chromatin accessibility. However, training these models is computation- and time-intensive, and datasets of comparable size do not exist for most other organisms. Here, we identify modifications to an existing state-of-the-art model that improve model accuracy while reducing training time and computational cost. Using this stream-lined model architecture, we investigate the ability of models pretrained on human genomic datasets to transfer performance to a variety of different tasks. Models pretrained on human data but fine-tuned on genomic datasets from diverse tissues and species achieved significantly higher prediction accuracy while significantly reducing training time compared to models trained from scratch, with Pearson correlation coefficients between experimental results and predictions as high as 0.8. Further, we found that including excessive training tasks decreased model performance and that this compromised performance could be partially but not completely rescued by fine-tuning. Thus, simplifying model architecture, applying pretrained models, and carefully considering the number of training tasks may be effective and economical techniques for building new models across data types, tissues, and species.
Publisher OA PDF DOI
Unconstrained Robust Online Convex Optimization
ArXiv.org · 2025-06-15
preprintOpen accessSenior author
This paper addresses online learning with ``corrupted'' feedback. Our learner is provided with potentially corrupted gradients $\tilde g_t$ instead of the ``true'' gradients $g_t$. We make no assumptions about how the corruptions arise: they could be the result of outliers, mislabeled data, or even malicious interference. We focus on the difficult ``unconstrained'' setting in which our algorithm must maintain low regret with respect to any comparison point $u \in \mathbb{R}^d$. The unconstrained setting is significantly more challenging as existing algorithms suffer extremely high regret even with very tiny amounts of corruption (which is not true in the case of a bounded domain). Our algorithms guarantee regret $ \|u\|G (\sqrt{T} + k) $ when $G \ge \max_t \|g_t\|$ is known, where $k$ is a measure of the total amount of corruption. When $G$ is unknown we incur an extra additive penalty of $(\|u\|^2+G^2) k$.
Publisher OA PDF DOI
Adaptive bandit algorithms increase efficiency of mobile tuberculosis screening programs
Scientific Reports · 2025-12-08
articleOpen access
Community-based tuberculosis screening using mobile X-ray units can effectively increase case detection rates by reducing barriers to accessing services. This study evaluated the multi-armed bandit (MAB) framework, a machine learning approach, for optimizing mobile screening locations. Using simulations, we compared two MAB algorithms-Exp3 and LinUCB-with strategies based on historical case rates and random placement. The MAB algorithms continually updated site selection based on observed screening yields, and LinUCB additionally incorporated local socioeconomic indicators associated with tuberculosis rates. Over three years, assuming two mobile units serving 95 sites in Lima, Peru, 1,000 simulations demonstrated the MAB algorithms significantly reduced the average number of screenings needed to detect one individual with tuberculosis: 112 (standard deviation [SD]: 10) for Exp3 and 79 (SD: 12) for LinUCB, versus 152 (SD: 11) for random placement and 143 (SD: 11) for historic case-rate-driven placement. LinUCB performed best, achieving a 20% increase in detection efficiency by week 16 and 50% by week 40 compared to case-rate-driven placement. Overall, both MAB algorithms improved tuberculosis screening yields, emphasizing the value of data-driven approaches for optimizing mobile screening interventions. Incorporating adaptive models into screening programs may enhance targeting efficiency and offers a promising direction for policymakers and implementers seeking to optimize resource allocation in high-burden setting.
Publisher OA PDF DOI
Fully Unconstrained Online Learning
2024-01-01 · 1 citations
article1st authorCorresponding
Publisher DOI
Fully Unconstrained Online Learning
arXiv (Cornell University) · 2024-05-30
preprintOpen access1st authorCorresponding
We provide an online learning algorithm that obtains regret $G\|w_\star\|\sqrt{T\log(\|w_\star\|G\sqrt{T})} + \|w_\star\|^2 + G^2$ on $G$-Lipschitz convex losses for any comparison point $w_\star$ without knowing either $G$ or $\|w_\star\|$. Importantly, this matches the optimal bound $G\|w_\star\|\sqrt{T}$ available with such knowledge (up to logarithmic factors), unless either $\|w_\star\|$ or $G$ is so large that even $G\|w_\star\|\sqrt{T}$ is roughly linear in $T$. Thus, it matches the optimal bound in all cases in which one can achieve sublinear regret, which arguably most "interesting" scenarios.
Publisher OA PDF DOI
General framework for online-to-nonconvex conversion: Schedule-free SGD is also effective for nonconvex optimization
arXiv (Cornell University) · 2024-11-11
preprintOpen accessSenior author
This work investigates the effectiveness of schedule-free methods, developed by A. Defazio et al. (NeurIPS 2024), in nonconvex optimization settings, inspired by their remarkable empirical success in training neural networks. Specifically, we show that schedule-free SGD achieves optimal iteration complexity for nonsmooth, nonconvex optimization problems. Our proof begins with the development of a general framework for online-to-nonconvex conversion, which converts a given online learning algorithm into an optimization algorithm for nonconvex losses. Our general framework not only recovers existing conversions but also leads to two novel conversion schemes. Notably, one of these new conversions corresponds directly to schedule-free SGD, allowing us to establish its optimality. Additionally, our analysis provides valuable insights into the parameter choices for schedule-free SGD, addressing a theoretical gap that the convex theory cannot explain.
Publisher OA PDF DOI
Random Scaling and Momentum for Non-smooth Non-convex Optimization
arXiv (Cornell University) · 2024-05-16 · 1 citations
preprintOpen accessSenior author
Training neural networks requires optimizing a loss function that may be highly irregular, and in particular neither convex nor smooth. Popular training algorithms are based on stochastic gradient descent with momentum (SGDM), for which classical analysis applies only if the loss is either convex or smooth. We show that a very small modification to SGDM closes this gap: simply scale the update at each time point by an exponentially distributed random scalar. The resulting algorithm achieves optimal convergence guarantees. Intriguingly, this result is not derived by a specific analysis of SGDM: instead, it falls naturally out of a more general framework for converting online convex optimization algorithms to non-convex optimization algorithms.
Publisher OA PDF DOI
The Road Less Scheduled
arXiv (Cornell University) · 2024-05-24 · 2 citations
preprintOpen accessSenior author
Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from convex problems to large-scale deep learning problems. Our Schedule-Free approach introduces no additional hyper-parameters over standard optimizers with momentum. Our method is a direct consequence of a new theory we develop that unifies scheduling and iterate averaging. An open source implementation of our method is available at https://github.com/facebookresearch/schedule_free. Schedule-Free AdamW is the core algorithm behind our winning entry to the MLCommons 2024 AlgoPerf Algorithmic Efficiency Challenge Self-Tuning track.
Publisher OA PDF DOI
Online Linear Regression in Dynamic Environments via Discounting
arXiv (Cornell University) · 2024-05-29 · 1 citations
preprintOpen accessSenior author
We develop algorithms for online linear regression which achieve optimal static and dynamic regret guarantees \emph{even in the complete absence of prior knowledge}. We present a novel analysis showing that a discounted variant of the Vovk-Azoury-Warmuth forecaster achieves dynamic regret of the form $R_{T}(\vec{u})\le O\left(d\log(T)\vee \sqrt{dP_{T}^γ(\vec{u})T}\right)$, where $P_{T}^γ(\vec{u})$ is a measure of variability of the comparator sequence, and show that the discount factor achieving this result can be learned on-the-fly. We show that this result is optimal by providing a matching lower bound. We also extend our results to \emph{strongly-adaptive} guarantees which hold over every sub-interval $[a,b]\subseteq[1,T]$ simultaneously.
Publisher OA PDF DOI
Adam with model exponential moving average is effective for nonconvex optimization
2024-01-01 · 6 citations
articleSenior author
Publisher DOI

Frequent coauthors

Francesco Orabona
12 shared
Manish Purohit
Google (United States)
11 shared
Aditya Bhaskara
9 shared
Harsh Mehta
University of Kansas Medical Center
9 shared
Kwabena Boahen
Stanford University
7 shared
E Aiden
Broad Institute
7 shared
Ravi Kumar
7 shared
Harsh Mehta
6 shared

Labs

Data Mining & Data ManagementPI

Education

Ph.D.
Stanford University
2018
B.A., Mathematics
Harvard
2013

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with Ashok Cutkosky

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you