Lee Giles
· David Reese Professor of Information Sciences and Technology, Director, The Intelligent Information Systems Research Laboratory, Graduate Faculty, Computer Science and Engineering, Professor (by courtesy), Supply Chain ManagementVerifiedPennsylvania State University · Social Data Analytics
Active 1994–2025
About
Dr. C. Lee Giles is a professor whose academic legacy includes supervising numerous Ph.D. students and collaborators. His doctoral graduates at Penn State include individuals who have gone on to notable academic careers, some of whom conducted most of their Ph.D. research with him at NEC Research Institute, Princeton, NJ. His research and academic influence are reflected in his extensive network of collaborators and students, with a notable history of working with prominent researchers and institutions. For more detailed information about his collaborators and academic genealogy, references are available on his Google Scholar and DBLP pages.
Research topics
- Computer Science
- Political Science
- Sociology
- Medicine
- Engineering
- Psychology
- Public relations
- Law
Selected publications
ArXiv.org · 2025-05-25
preprintOpen accessEffective teaching requires adapting instructional strategies to accommodate the diverse cognitive and behavioral profiles of students, a persistent challenge in education and teacher training. While Large Language Models (LLMs) offer promise as tools to simulate such complex pedagogical environments, current simulation frameworks are limited in two key respects: (1) they often reduce students to static knowledge profiles, and (2) they lack adaptive mechanisms for modeling teachers who evolve their strategies in response to student feedback. To address these gaps, \textbf{we introduce a novel simulation framework that integrates LLM-based heterogeneous student agents with a self-optimizing teacher agent}. The teacher agent's pedagogical policy is dynamically evolved using a genetic algorithm, allowing it to discover and refine effective teaching strategies based on the aggregate performance of diverse learners. In addition, \textbf{we propose Persona-RAG}, a Retrieval Augmented Generation module that enables student agents to retrieve knowledge tailored to their individual learning styles. Persona-RAG preserves the retrieval accuracy of standard RAG baselines while enhancing personalization, an essential factor in modeling realistic educational scenarios. Through extensive experiments, we demonstrate how our framework supports the emergence of distinct and interpretable teaching patterns when interacting with varied student populations. Our results highlight the potential of LLM-driven simulations to inform adaptive teaching practices and provide a testbed for training human educators in controlled, data-driven environments.
Neurosymbolic Artificial Intelligence · 2025-10-01 · 7 citations
articleOpen accessResolving the dichotomy between the human-like yet constrained reasoning processes of cognitive architectures (CAs) and the broad but often noisy inference behavior of large language models (LLMs) remains a challenging yet exciting pursuit, aimed at enabling reliable machine reasoning capabilities in LLMs. Previous approaches that employ off-the-shelf LLMs in manufacturing decision-making face challenges in complex reasoning tasks, often exhibiting human-level yet unhuman-like behaviors due to insufficient grounding. This present article start to address this gap by asking whether LLMs can replicate cognition from CAs to make human-like decisions. We introduce cognitive LLMs , which are hybrid decision-making architectures comprised of a CA and an LLM through a knowledge transfer mechanism LLM-ACTR . Cognitive LLMs extract and embed knowledge of CA’s internal decision-making process as latent neural representations, inject this information into trainable LLM adapter layers, and fine-tune the LLMs for downstream prediction tasks. We find that, after knowledge transfer through LLM-ACTR , the cognitive LLMs offers better representations of human decision-making behaviors on a novel design for manufacturing problem, compared to an LLM-only model that employs chain-of-thought. Taken together, the results open up new research directions for equipping LLMs with the necessary knowledge to computationally model and replicate the internal mechanisms of human cognitive decision-making. We release the code and data samples at https://github.com/SiyuWu528/LLM-ACTR .
Multi-LLM Collaborative Caption Generation in Scientific Documents
arXiv (Cornell University) · 2025-01-05
preprintOpen accessScientific figure captioning is a complex task that requires generating contextually appropriate descriptions of visual content. However, existing methods often fall short by utilizing incomplete information, treating the task solely as either an image-to-text or text summarization problem. This limitation hinders the generation of high-quality captions that fully capture the necessary details. Moreover, existing data sourced from arXiv papers contain low-quality captions, posing significant challenges for training large language models (LLMs). In this paper, we introduce a framework called Multi-LLM Collaborative Figure Caption Generation (MLBCAP) to address these challenges by leveraging specialized LLMs for distinct sub-tasks. Our approach unfolds in three key modules: (Quality Assessment) We utilize multimodal LLMs to assess the quality of training data, enabling the filtration of low-quality captions. (Diverse Caption Generation) We then employ a strategy of fine-tuning/prompting multiple LLMs on the captioning task to generate candidate captions. (Judgment) Lastly, we prompt a prominent LLM to select the highest quality caption from the candidates, followed by refining any remaining inaccuracies. Human evaluations demonstrate that informative captions produced by our approach rank better than human-written captions, highlighting its effectiveness. Our code is available at https://github.com/teamreboott/MLBCAP
SoAC and SoACer: A Sector-Based Corpus and LLM-Based Framework for Sectoral Website Classification
2025-08-27
articleOpen accessOne approach to understanding the vastness and complexity of the web is to categorize websites into sectors that reflect the specific industries or domains in which they operate. However, existing website classification approaches often struggle to handle the noisy, unstructured, and lengthy nature of web content, and current datasets lack a universal sector classification labeling system specifically designed for the web. To address these issues, we introduce SoAC (Sector of Activity Corpus), a large-scale corpus comprising 195, 495 websites categorized into 10 broad sectors tailored for web content, which serves as the benchmark for evaluating our proposed classification framework, SoACer (Sector of Activity Classifier). Building on this resource, SoACer is a novel end-to-end classification framework that first fetches website information, then incorporates extractive summarization to condense noisy and lengthy content into a concise representation, and finally employs large language model (LLM) embeddings (Llama3-8B) combined with a classification head to achieve accurate sectoral prediction. Through extensive experiments, including ablation studies and detailed error analysis, we demonstrate that SoACer achieves an overall accuracy of 72.6% on our proposed SoAC dataset. Our ablation study confirms that extractive summarization not only reduces computational overhead but also enhances classification performance, while our error analysis reveals meaningful sector overlaps that underscore the need for multi-label and hierarchical classification frameworks. These findings provide a robust foundation for future exploration of advanced classification techniques that better capture the complex nature of modern website content.1
Multi-LLM Collaborative Caption Generation in Scientific Documents
Communications in computer and information science · 2025-01-01 · 3 citations
book-chapter2025-01-01 · 1 citations
articleOpen accessAutomated Detection and Analysis of Data Practices Using A Real-World Corpus
arXiv (Cornell University) · 2024-02-16
preprintOpen accessPrivacy policies are crucial for informing users about data practices, yet their length and complexity often deter users from reading them. In this paper, we propose an automated approach to identify and visualize data practices within privacy policies at different levels of detail. Leveraging crowd-sourced annotations from the ToS;DR platform, we experiment with various methods to match policy excerpts with predefined data practice descriptions. We further conduct a case study to evaluate our approach on a real-world policy, demonstrating its effectiveness in simplifying complex policies. Experiments show that our approach accurately matches data practice descriptions with policy excerpts, facilitating the presentation of simplified privacy information to users.
Value in Health · 2024-12-01
articleOpen access1st authorCorrespondingarXiv (Cornell University) · 2024-10-04
preprintOpen accessThis study investigates the learnability of Recurrent Neural Networks (RNNs) in classifying structured formal languages, focusing on counter and Dyck languages. Traditionally, both first-order (LSTM) and second-order (O2RNN) RNNs have been considered effective for such tasks, primarily based on their theoretical expressiveness within the Chomsky hierarchy. However, our research challenges this notion by demonstrating that RNNs primarily operate as state machines, where their linguistic capabilities are heavily influenced by the precision of their embeddings and the strategies used for sampling negative examples. Our experiments revealed that performance declines significantly as the structural similarity between positive and negative examples increases. Remarkably, even a basic single-layer classifier using RNN embeddings performed better than chance. To evaluate generalization, we trained models on strings up to a length of 40 and tested them on strings from lengths 41 to 500, using 10 unique seeds to ensure statistical robustness. Stability comparisons between LSTM and O2RNN models showed that O2RNNs generally offer greater stability across various scenarios. We further explore the impact of different initialization strategies revealing that our hypothesis is consistent with various RNNs. Overall, this research questions established beliefs about RNNs' computational capabilities, highlighting the importance of data structure and sampling techniques in assessing neural networks' potential for language classification tasks. It emphasizes that stronger constraints on expressivity are crucial for understanding true learnability, as mere expressivity does not capture the essence of learning.
Investigating Symbolic Capabilities of Large Language Models
arXiv (Cornell University) · 2024-05-21
preprintOpen accessPrompting techniques have significantly enhanced the capabilities of Large Language Models (LLMs) across various complex tasks, including reasoning, planning, and solving math word problems. However, most research has predominantly focused on language-based reasoning and word problems, often overlooking the potential of LLMs in handling symbol-based calculations and reasoning. This study aims to bridge this gap by rigorously evaluating LLMs on a series of symbolic tasks, such as addition, multiplication, modulus arithmetic, numerical precision, and symbolic counting. Our analysis encompasses eight LLMs, including four enterprise-grade and four open-source models, of which three have been pre-trained on mathematical tasks. The assessment framework is anchored in Chomsky's Hierarchy, providing a robust measure of the computational abilities of these models. The evaluation employs minimally explained prompts alongside the zero-shot Chain of Thoughts technique, allowing models to navigate the solution process autonomously. The findings reveal a significant decline in LLMs' performance on context-free and context-sensitive symbolic tasks as the complexity, represented by the number of symbols, increases. Notably, even the fine-tuned GPT3.5 exhibits only marginal improvements, mirroring the performance trends observed in other models. Across the board, all models demonstrated a limited generalization ability on these symbol-intensive tasks. This research underscores LLMs' challenges with increasing symbolic complexity and highlights the need for specialized training, memory and architectural adjustments to enhance their proficiency in symbol-based reasoning tasks.
Frequent coauthors
- 1 shared
Anusha Ranganathan
- 1 shared
Matthew Kemp
University of Oxford
- 1 shared
Elizabeth A. Williams
University of Sheffield
- 1 shared
Jose Canchan
- 1 shared
Sean O'Bannon
- 1 shared
Laurence L. Benson
- 1 shared
Stephanie Taylor
Queen Mary University of London
- 1 shared
Matthew Burgess
University of Aberdeen
Labs
Education
Phd, Optical Sciences
University of Arizona
MS, Physics
University of Michigan
Awards & honors
- IEEE Pioneer Award
- INNS Gabor Award
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Lee Giles
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup