
Gustavo de Veciana
· ProfessorVerifiedUniversity of Texas at Austin · Electrical and Computer Engineering
Active 1990–2026
About
Professor Gustavo de Veciana leads a research group focused on applied research and development, with interests spanning summer, part-time, permanent, and consulting work. His group has a strong connection to industry, with members working or having worked at leading technology companies such as Apple, CISCO, Dropbox, Intel, Google, Samsung, and Qualcomm. The group’s research activities are applied in nature, emphasizing practical impact and collaboration with industry partners. Professor de Veciana’s mentorship extends to a diverse set of students and visiting researchers from various international institutions, reflecting a broad engagement with the global research community. The research topics under his guidance include resource-aware scheduling, wireless network performance, quality of service in wireless and cloud systems, network slicing, and collaborative sensing, among others, demonstrating a comprehensive focus on networking, wireless communications, and resource allocation in complex systems.
Research topics
- Computer Science
- Computer network
- Telecommunications
- Operations research
- Distributed computing
Selected publications
Online Learning for Multi-Layer Hierarchical Inference under Partial and Policy-Dependent Feedback
Open MIND · 2026-03-04
preprintHierarchical inference systems route tasks across multiple computational layers, where each node may either finalize a prediction locally or offload the task to a node in the next layer for further processing. Learning optimal routing policies in such systems is challenging: inference loss is defined recursively across layers, while feedback on prediction error is revealed only at a terminal oracle layer. This induces a partial, policy-dependent feedback structure in which observability probabilities decay with depth, causing importance-weighted estimators to suffer from amplified variance. We study online routing for multi-layer hierarchical inference under long-term resource constraints and terminal-only feedback. We formalize the recursive loss structure and show that naive importance-weighted contextual bandit methods become unstable as feedback probability decays along the hierarchy. To address this, we develop a variance-reduced EXP4-based algorithm integrated with Lyapunov optimization, yielding unbiased loss estimation and stable learning under sparse and policy-dependent feedback. We provide regret guarantees relative to the best fixed routing policy in hindsight and establish near-optimality under stochastic arrivals and resource constraints. Experiments on large-scale multi-task workloads demonstrate improved stability and performance compared to standard importance-weighted approaches.
TIDE: Task-driven DNN Training and Splitting for Efficient Inference at the Mobile Edge
Zenodo (CERN European Organization for Nuclear Research) · 2026-02-09
articleOpen accessThe growing demands of DNN-based inference at the mobile edge is driving the need for increasingly efficient execution. Such applications often require fast and high-quality outputs, which are hard to realize due to the limited computa- tional and communication capabilities at the edge. This paper tackles these issues focusing on a DNN for the execution of tasks that are homogeneous in nature but heterogeneous in their domains. The key idea is to start with a parent DNN of interconnected computational elements (atoms), and strategically form a collection of task-specific DNNs suitable for distributed deployment. Such task-specific DNNs may include common as well as uniquely used atoms of the parent DNN. Ultimately, the aim is that they be smaller in size – thus a better match for edge resources – and achieve low-cost inference. We solve the problem of determining the best collection of task-specific DNNs through an algorithmic framework named TIDE. Experimental results show that TIDE decreases inference cost and time by 90% and 80% (resp.) relatively to centralized approaches, and by over 60% and 70% (resp.) when compared to the best benchmark.
Online Learning for Multi-Layer Hierarchical Inference under Partial and Policy-Dependent Feedback
ArXiv.org · 2026-03-04
articleOpen accessHierarchical inference systems route tasks across multiple computational layers, where each node may either finalize a prediction locally or offload the task to a node in the next layer for further processing. Learning optimal routing policies in such systems is challenging: inference loss is defined recursively across layers, while feedback on prediction error is revealed only at a terminal oracle layer. This induces a partial, policy-dependent feedback structure in which observability probabilities decay with depth, causing importance-weighted estimators to suffer from amplified variance. We study online routing for multi-layer hierarchical inference under long-term resource constraints and terminal-only feedback. We formalize the recursive loss structure and show that naive importance-weighted contextual bandit methods become unstable as feedback probability decays along the hierarchy. To address this, we develop a variance-reduced EXP4-based algorithm integrated with Lyapunov optimization, yielding unbiased loss estimation and stable learning under sparse and policy-dependent feedback. We provide regret guarantees relative to the best fixed routing policy in hindsight and establish near-optimality under stochastic arrivals and resource constraints. Experiments on large-scale multi-task workloads demonstrate improved stability and performance compared to standard importance-weighted approaches.
TIDE: Task-driven DNN Training and Splitting for Efficient Inference at the Mobile Edge
Open MIND · 2026-02-09
articleThe growing demands of DNN-based inference at the mobile edge is driving the need for increasingly efficient execution. Such applications often require fast and high-quality outputs, which are hard to realize due to the limited computa- tional and communication capabilities at the edge. This paper tackles these issues focusing on a DNN for the execution of tasks that are homogeneous in nature but heterogeneous in their domains. The key idea is to start with a parent DNN of interconnected computational elements (atoms), and strategically form a collection of task-specific DNNs suitable for distributed deployment. Such task-specific DNNs may include common as well as uniquely used atoms of the parent DNN. Ultimately, the aim is that they be smaller in size – thus a better match for edge resources – and achieve low-cost inference. We solve the problem of determining the best collection of task-specific DNNs through an algorithmic framework named TIDE. Experimental results show that TIDE decreases inference cost and time by 90% and 80% (resp.) relatively to centralized approaches, and by over 60% and 70% (resp.) when compared to the best benchmark.
CoreQ: Learning-Free Mismatch Correction and Successive Rounding for Quantization
Open MIND · 2026-02-05
preprintPost-training quantization (PTQ) enables efficient deployment of large language models by mapping pretrained weights to low-bit formats without retraining, typically using a small calibration set to minimize a layer-wise calibration objective. However, this sequential procedure induces a mismatch: errors from earlier quantized layers alter the inputs received by later layers, causing the activations to deviate from those of the full-precision model. Recent approaches introduce mismatch-aware calibration objectives to compensate for this effect, but leave open how much of the observed mismatch should shift each layer's calibration target. Fully applying this correction can overfit limited calibration data, while scaling the mismatch correction with a fixed coefficient ignores varying reliability of mismatch estimates across layers. To address these limitations, we propose CoreQ, a learning-free PTQ framework that applies a closed-form coefficient for mismatch correction derived from a geometric decomposition of the mismatch. The resulting coefficient adapts the correction across layers, reduces overfitting to finite calibration data, and requires no hyperparameter tuning. Given the corrected target, CoreQ minimizes the induced triangular least-squares objective with an efficient greedy successive-rounding solver and a bounded beam-search extension, K-CoreQ, that trades modest additional compute for improved performance. Across multiple LLM families, scales, bit-widths, and quantization settings, CoreQ improves perplexity and downstream accuracy over strong PTQ baselines.
CoreQ: Learning-Free Mismatch Correction and Successive Rounding for Quantization
arXiv (Cornell University) · 2026-02-05
articleOpen accessLarge language models (LLMs) deliver robust performance across diverse applications, yet their deployment often faces challenges due to the memory and latency costs of storing and accessing billions of parameters. Post-training quantization (PTQ) enables efficient inference by mapping pretrained weights to low-bit formats without retraining, but its effectiveness depends critically on both the quantization objective and the rounding procedure used to obtain low-bit weight representations. In this work, we show that interpolating between symmetric and asymmetric calibration acts as a form of regularization that preserves the standard quadratic structure used in PTQ while providing robustness to activation mismatch. Building on this perspective, we derive a simple successive rounding procedure that naturally incorporates asymmetric calibration, as well as a bounded-search extension that allows for an explicit trade-off between quantization quality and the compute cost. Experiments across multiple LLM families, quantization bit-widths, and benchmarks demonstrate that the proposed bounded search based on a regularized asymmetric calibration objective consistently improves perplexity and accuracy over PTQ baselines, while incurring only modest and controllable additional computational cost.
Generative Diffusion Model-Based Compression of MIMO CSI
2025-06-08 · 5 citations
articleWhile neural lossy compression techniques have markedly advanced the efficiency of Channel State Information (CSI) compression and reconstruction for feedback in MIMO communications, efficient algorithms for more challenging and practical tasks—such as CSI compression for future channel prediction and reconstruction with relevant side information—remain underexplored, often resulting in suboptimal performance when existing methods are extended to these scenarios. To that end, we propose a novel framework for compression with side information, featuring an encoding process with fixed-rate compression using a trainable codebook for codeword quantization, and a decoding procedure modeled as a backward diffusion process conditioned on both the codeword and the side information. Experimental results show that our method significantly outperforms existing CSI compression algorithms, often yielding over twofold performance improvement by achieving comparable distortion at less than half the data rate of competing methods in certain scenarios. These findings underscore the potential of diffusion-based compression for practical deployment in communication systems.
Importance Sampling via Score-based Generative Models
arXiv (Cornell University) · 2025-02-07
preprintOpen accessSenior authorImportance sampling, which involves sampling from a probability density function (PDF) proportional to the product of an importance weight function and a base PDF, is a powerful technique with applications in variance reduction, biased or customized sampling, data augmentation, and beyond. Inspired by the growing availability of score-based generative models (SGMs), we propose an entirely training-free Importance sampling framework that relies solely on an SGM for the base PDF. Our key innovation is realizing the importance sampling process as a backward diffusion process, expressed in terms of the score function of the base PDF and the specified importance weight function--both readily available--eliminating the need for any additional training. We conduct a thorough analysis demonstrating the method's scalability and effectiveness across diverse datasets and tasks, including importance sampling for industrial and natural images with neural importance weight functions. The training-free aspect of our method is particularly compelling in real-world scenarios where a single base distribution underlies multiple biased sampling tasks, each requiring a different importance weight function. To the best of our knowledge our approach is the first importance sampling framework to achieve this.
Fundamentals of Caching Layered Data objects
2025-07-21 · 1 citations
articleThe effective management of the vast amounts of data processed or required by modern cloud and edge computing systems remains a fundamental challenge. This paper focuses on cache management for applications where data objects can be stored in layered representations. In such representations, each additional data layer enhances the “quality” of the object’s version, albeit at the cost of increased memory usage. This layered approach is advantageous in various scenarios, including the delivery of zoomable maps, video coding, future virtual reality gaming, and layered neural network models, where additional data layers improve quality/inference accuracy. In systems where users or devices request different versions of a data object, layered representations provide the flexibility needed for caching policies to achieve improved hit rates, i.e., delivering the specific representations required by users. This paper investigates the performance of the Least Recently Used (LRU) caching policy in the context of layered representation for data, referred to as Layered LRU (LLRU). To this end, we develop an asymptotically accurate analytical model for LLRU. We analyze how LLRU’s performance is influenced by factors such as the number of layers, as well as the popularity and size of an object’s layers. For example, our results demonstrate that, in the case of LLRU, adding more layers does not always enhance performance. Instead, the effectiveness of LLRU depends intricately on the popularity distribution and size characteristics of the layers.
Optimal Scheduling Algorithms for LLM Inference: Theory and Practice
ArXiv.org · 2025-08-01
preprintOpen accessSenior authorWith the growing use of Large Language Model (LLM)-based tools like ChatGPT, Perplexity, and Gemini across industries, there is a rising need for efficient LLM inference systems. These systems handle requests with a unique two-phase computation structure: a prefill-phase that processes the full input prompt and a decode-phase that autoregressively generates tokens one at a time. This structure calls for new strategies for routing and scheduling requests. In this paper, we take a comprehensive approach to this challenge by developing a theoretical framework that models routing and scheduling in LLM inference systems. We identify two key design principles-optimal tiling and dynamic resource allocation-that are essential for achieving high throughput. Guided by these principles, we propose the Resource-Aware Dynamic (RAD) scheduler and prove that it achieves throughput optimality under mild conditions. To address practical Service Level Objectives (SLOs) such as serving requests with different Time Between Token (TBT) constraints, we design the SLO-Aware LLM Inference (SLAI) scheduler. SLAI uses real-time measurements to prioritize decode requests that are close to missing their TBT deadlines and reorders prefill requests based on known prompt lengths to further reduce the Time To First Token (TTFT) delays. We evaluate SLAI on the Openchat ShareGPT4 dataset using the Mistral-7B model on an NVIDIA RTX ADA 6000 GPU. Compared to Sarathi-Serve, SLAI reduces the median TTFT by 53% and increases the maximum serving capacity by 26% such that median TTFT is below 0.5 seconds, while meeting tail TBT latency constraints.
Recent grants
CNS Core: Small: Online Safe Reinforcement Learning for Wireless Resource Allocation
NSF · $500k · 2019–2023
CSR-EHS: Novel Mobile and Distributed Embedded Systems for Pervasive Computing Applications
NSF · $340k · 2005–2011
NeTS: Small: Collaborative Research: Supporting unstructured peer-to-peer social networking
NSF · $150k · 2009–2013
Collaborative Research: Extreme Densification of Wireless Networks
NSF · $734k · 2014–2018
Visibility and Interactive Information Sharing in Collaborative Sensing Systems
NSF · $450k · 2018–2022
Frequent coauthors
- 23 shared
Albert Banchs
Universidad Carlos III de Madrid
- 18 shared
François Baccelli
École Normale Supérieure - PSL
- 18 shared
Margarida F. Jacome
- 15 shared
Haris Vikalo
The University of Texas at Austin
- 15 shared
Sanjay Shakkottai
- 14 shared
Arjun Anand
Delhi Technological University
- 13 shared
Virag Shah
- 12 shared
Alan C. Bovik
Awards & honors
- 1996 National Science Foundation CAREER Award
- IEEE/CAS William J.. McCalla ICCAD Best Paper Award (2000)
- Elevated to IEEE Fellow status in 2009
- 2014 INFOCOM Best Paper Award
- Best Paper at ACM MSWiM 2010
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Gustavo de Veciana
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup