
Marcus Botacin
· Assistant Professor, Computer Science & EngineeringVerifiedTexas A&M University · Computer Science & Engineering
Active 2014–2025
About
Marcus Botacin is a Computer Security Researcher with a focus on malware analysis, evasion, and detection, sandbox development, antivirus operation, hardware-assisted security solutions, and reverse engineering.
Research topics
- Computer Science
- Computer Security
- Artificial Intelligence
- Data science
- Machine Learning
- Data Mining
- Software engineering
- Programming language
Selected publications
Towards Explainable Drift Detection and Early Retrain in ML-Based Malware Detection Pipelines
Lecture notes in computer science · 2025-01-01
book-chapterSenior authorML-Based Behavioral Malware Detection Is Far From a Solved Problem
2025-04-09 · 3 citations
articleOpen accessMalware detection is a ubiquitous application of Machine Learning (ML) in security. In behavioral malware analysis, the detector relies on features extracted from program execution traces. The research literature has focused on detectors trained with features collected from sandbox environments and evaluated on samples also analyzed in a sandbox. However, in deployment, a malware detector at endpoint hosts often must rely on traces captured from endpoint hosts, not from a sandbox. Thus, there is a gap between the literature and real-world needs. We present the first measurement study of the performance of ML-based malware detectors at real-world endpoints. Leveraging a dataset of sandbox traces and a dataset of in-the-wild program traces, we evaluate two scenarios: (i) an endpoint detector trained on sandbox traces (convenient and easy to train), and (ii) an endpoint detector trained on endpoint traces (more challenging to train, since we need to collect telemetry data). We discover a wide gap between the performance as measured using prior evaluation methods in the literature—over 90%—vs. expected performance in endpoint detection—about 20% (scenario (i)) to 50% (scenario (ii)). We characterize the ML challenges that arise in this domain and contribute to this gap, including label noise, distribution shift, and spurious features. Moreover, we show several techniques that achieve 5–30% relative performance improvements over the baselines. Our evidence suggests that applying detectors trained on sandbox data to endpoint detection is challenging. The most promising direction is training detectors directly on endpoint data, which marks a departure from current practice. To promote progress, we will facilitate researchers to perform realistic detector evaluations against our real-world dataset.
ArXiv.org · 2025-04-15
preprintOpen accessSenior authorThe large integration of microphones into devices increases the opportunities for Acoustic Side-Channel Attacks (ASCAs), as these can be used to capture keystrokes' audio signals that might reveal sensitive information. However, the current State-Of-The-Art (SOTA) models for ASCAs, including Convolutional Neural Networks (CNNs) and hybrid models, such as CoAtNet, still exhibit limited robustness under realistic noisy conditions. Solving this problem requires either: (i) an increased model's capacity to infer contextual information from longer sequences, allowing the model to learn that an initially noisily typed word is the same as a futurely collected non-noisy word, or (ii) an approach to fix misidentified information from the contexts, as one does not type random words, but the ones that best fit the conversation context. In this paper, we demonstrate that both strategies are viable and complementary solutions for making ASCAs practical. We observed that no existing solution leverages advanced transformer architectures' power for these tasks and propose that: (i) Visual Transformers (VTs) are the candidate solutions for capturing long-term contextual information and (ii) transformer-powered Large Language Models (LLMs) are the candidate solutions to fix the ``typos'' (mispredictions) the model might make. Thus, we here present the first-of-its-kind approach that integrates VTs and LLMs for ASCAs. We first show that VTs achieve SOTA performance in classifying keystrokes when compared to the previous CNN benchmark. Second, we demonstrate that LLMs can mitigate the impact of real-world noise. Evaluations on the natural sentences revealed that: (i) incorporating LLMs (e.g., GPT-4o) in our ASCA pipeline boosts the performance of error-correction tasks; and (ii) the comparable performance can be attained by a lightweight, fine-tuned smaller LLM (67 times smaller than GPT-4o), using...
Cross-Regional Malware Detection via Model Distilling and Federated Learning
2024-09-29 · 4 citations
articleOpen access1st authorCorrespondingMachine Learning (ML) is a key part of modern malware detection pipelines, but its application is not straightforward. It involves multiple practical challenges that are frequently unaddressed by the literature works. A key challenge is the heterogeneity of scenarios. Antivirus (AV) companies for instance operate under different performance constraints in the backend and in the endpoint, and with a diversity of datasets according to the country they operate in. In this paper, we evaluate the impact of these heterogeneous aspects by developing a classification pipeline for 3 datasets of 10K malware samples each collected by an AV company in the USA, Brazil, and Japan in the same period. We characterize the different requirements for these datasets and we show that a different number of features is required to reach the optimal detection rate in each scenario. We show that a global model combining the three datasets increases the detection of the three individual datasets. We propose using Federated Learning (FL) to build the global model and a distilling process to generate the local versions. We order the samples temporally to show that although retraining on concept drift detection helps recover the detection rate, only a FL approach can increase the detection rate.
2024-09-29
articleOpen access1st authorCorrespondingMalware analysis tasks are as fundamental for modern cybersecurity as they are challenging to perform. More than depending on any tool capability, malware analysis tasks depend on human analysts’ abilities, experiences, and practices when using the tools. Academic research has traditionally been focused on producing solutions to overcome malware analysis technical challenges, but are these solutions adopted in practice by malware analysts? Are these solutions useful? If not, how can the academic community improve its practices to foster adoption and cause a greater impact? To answer these questions, we surveyed 21 professional malware analysts working in different companies, from CSIRTs to AV companies, to hear their opinions about existing tools, practices, and the challenges they face in their daily tasks. In 31 questions, we cover a broad range of aspects, from the number of observed malware variants to the use of public sandboxes and the tools the analysts would like to exist to make their lives easier. We aim to bridge the gap between academic developments and malware practices. To do so, on the one hand, we suggest to the analysts the solutions proposed in the literature that could be integrated into their practices. On the other hand, we also point out to the academic community possible future directions to bridge existing development gaps that significantly affect malware analysis practices.
Digital Threats Research and Practice · 2024-10-11 · 1 citations
articleOpen access1st authorCorrespondingIn real life, distinct runs of the same artifact lead to the exploration of different paths, due to either system’s natural randomness or malicious constructions. These variations might completely change execution outcomes (extreme case). Thus, to analyze malware beyond theoretical models, we must consider the execution of multiple paths. The academic literature presents many approaches for multipath analysis (e.g., fuzzing, symbolic, and concolic executions), but it still fails to answer What’s the current state of multipath malware tracing? This work aims to answer this question and also to point out What developments are still required to make them practical? Thus, we present a literature survey and perform experiments to bridge theory and practice. Our results show that (i) natural variation is frequent; (ii) fuzzing helps to discover more paths; (iii) fuzzing can be guided to increase coverage; (iv) forced execution maximizes path discovery rates; (v) pure symbolic execution is impractical, and (vi) concolic execution is promising but still requires further developments.
On the uniqueness of AntiVirus labels: How many labels do we need to fingerprint an AV?
Journal of Computer Virology and Hacking Techniques · 2024-11-22
article1st authorCorresponding2024-01-01 · 1 citations
articleOpen accessTowards more realistic evaluations: The impact of label delays in malware detection pipelines
Computers & Security · 2024-09-19 · 5 citations
article1st authorCorrespondingUma Estratégia Dinâmica para a Detecção de Anomalias em Binários WebAssembly
2023-09-18
articleOpen accessWebAssembly é um formato binário de baixo nível, que oferece um alvo de compilação para linguagens de alto nível. Oferecendo mais segurança para os usuários na Web, com um formato de instruções binárias o WebAssembly é suportado por mais de 95% dos navegadores Web. No entanto, o crescimento no uso do WebAssembly trouxe preocupações em relação à sua segurança e seu possível uso de forma maliciosa. Dado que o WebAssemby é um formato de instruções de baixo nível, torna-se essencial a identificação do propósito dos códigos desenvolvidos, por meio da extração de suas características. O uso de WebAssembly para ataques de cryptojacking e ofuscação de códigos maliciosos é frequentemente observado. Nesse contexto, esse trabalho apresenta uma estratégia para a identificação de anomalias em binários WebAssembly, através de extração de características e análise estática. A estratégia proposta neste artigo alcançou um f1score de 99.3%, evidenciando seu potencial.
Recent grants
Frequent coauthors
- 54 shared
André Grégio
- 39 shared
Paulo Lício de Geus
- 14 shared
Fabrício Ceschin
Georgia Institute of Technology
- 8 shared
Heitor Murilo Gomes
- 7 shared
Lucas Galante
- 5 shared
Daniela S Oliveira
University of Florida
- 4 shared
Ruimin Sun
Florida International University
- 4 shared
Luiz S. Oliveira
Labs
Education
- 2021
Ph.D., Computer Science
Federal University of Paraná (UFPR-Brazil)
- 2017
M.S., Computer Science
University of Campinas (UNICAMP-Brazil)
- 2015
B.S., Computer Engineering
University of Campinas (UNICAMP-Brazil)
Awards & honors
- Outstanding Alumnus - DInf/UFPR - 2025
- Top-3 Best PhD Thesis in Security - Brazilian Computer Socie…
- Best PhD Thesis - Informatics Department/UFPR - 2022
- Best Master Dissertation in Security - 1st place - Brazilian…
- Best Master Dissertation - Institute of Computing/UNICAMP -…
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Marcus Botacin
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup