David Karger

Verified

Massachusetts Institute of Technology · Electrical Engineering & Computer Science

Active 1992–2024

h-index95

Citations53.4k

Papers38451 last 5y

Funding$1.0M

Faculty page

See your match with David Karger — sign in to PhdFit.Sign in

Research topics

Computer Science
Artificial Intelligence
Machine Learning
Psychology
Data Mining
Natural Language Processing
Epistemology
Engineering
Data science
Aerospace engineering
Mathematics
Mathematics education
Combinatorics

Selected publications

New Methods for Confusion Detection in Course Forums: Student, Teacher, and Machine
IEEE Transactions on Learning Technologies · 2021 · 9 citations
Senior authorCorresponding
- Computer Science
- Computer Science
- Artificial Intelligence
This article provides computational and rule-based approaches for detecting confusion that is expressed in students' comments in couse forums. To obtain reliable, ground truth data about which posts exhibit student confusion, we designed a decision tree that facilitates the manual labeling of forum posts by experts. However, manual labeling is costly in time and resources, which limits the amount of data that can be generated using this process. Our strategy for overcoming these limitations was to generate rules for detecting confusion based on student input via hashtags, which reflect the student's affective states. We show that the resulting rules closely align with the ground truth judgement of experts. We next applied these rules to datasets of students' forum posts in a large-scale biology course, thereby automatically generating thousands of labeled instances of “confused posts.” Finally, the resulting dataset was used to train a machine learning model for detecting whether students' posts exhibit confusion in the absence of hashtags. In this task, the pretrained language model based on bidirectional encoder representation from transformers (BERT) was able to outperform traditional machine learning models for classifying confusion in posts. This model was also able to generalize and detect student confusion across different offerings of the same course. Ultimately, the use of pretrained language models of this type will provide teachers with better technologies for detecting and alleviating confusion in online discussion forums by leveraging the combined input of teachers and students.
DOI
Seeding Course Forums using the Teacher-in-the-Loop
2021 · 3 citations
- Computer Science
- Computer Science
- Mathematics education
Online forums are an integral part of modern day courses, but motivating students to participate in educationally beneficial discussions can be challenging. Our proposed solution is to initialize (or “seed”) a new course forum with comments from past instances of the same course that are intended to trigger discussion that is beneficial to learning. In this work, we develop methods for selecting high-quality seeds and evaluate their impact over one course instance of a 186-student biology class. We designed a scale for measuring the “seeding suitability” score of a given thread (an opening comment and its ensuing discussion). We then constructed a supervised machine learning (ML) model for predicting the seeding suitability score of a given thread. This model was evaluated in two ways: first, by comparing its performance to the expert opinion of the course instructors on test/holdout data; and second, by embedding it in a live course, where it was actively used to facilitate seeding by the course instructors. For each reading assignment in the course, we presented a ranked list of seeding recommendations to the course instructors, who could review the list and filter out seeds with inconsistent or malformed content. We then ran a randomized controlled study, in which one group of students was shown seeds that were recommended by the ML model, and another group was shown seeds that were recommended by an alternative model that ranked seeds purely by the length of discussion that was generated in previous course instances. We found that the group of students that received posts from either seeding model generated more discussion than a control group in the course that did not get seeded posts. Furthermore, students who received seeds selected by the ML-based model showed higher levels of engagement, as well as greater learning gains, than those who received seeds ranked by length of discussion.
DOI
#Confused and beyond
2020 · 13 citations
- Computer Science
- Computer Science
- Artificial Intelligence
Students' confusion is a barrier for learning, contributing to loss of motivation and to disengagement with course materials. However, detecting students' confusion in large-scale courses is both time and resource intensive. This paper provides a new approach for confusion detection in online forums that is based on harnessing the power of students' self-reported affective states (reported using a set of pre-defined hashtags). It presents a rule for labeling confusion, based on students' hashtags in their posts, that is shown to align with teachers' judgement. We use this labeling rule to inform the design of an automated classifier for confusion detection for the case when there are no self-reported hashtags present in the test set. We demonstrate this approach in a large scale Biology course using the Nota Bene annotation platform. This work lays the foundation to empower teachers with better support tools for detecting and alleviating confusion in online courses.
DOI
ARDA
Proceedings of the VLDB Endowment · 2020 · 63 citations
Senior authorCorresponding
- Computer Science
- Computer Science
- Machine Learning
Automatic machine learning (AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal "human-in-the-loop" involvement. We present ARDA, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets.
DOI

Recent grants

III-COR: Data Homesteading: Tools to let Scientific Users Harvest, Husband, and Share Structured Information
NSF · $380k · 2007–2010
Applied Algorithms: Tech Transfer from the Algorithms Toolbox
NSF · $250k · 2006–2009
AF: Small: Applied Algorithims: Tech Transfer from the Algorithims Toolbox II
NSF · $400k · 2011–2015

Frequent coauthors

Robert C. Miller
Mayo Clinic in Florida
31 shared
Michael S. Bernstein
28 shared
Amy X. Zhang
26 shared
David Huynh
Denali Therapeutics (United States)
25 shared
Dennis Quan
Duke University
24 shared
m.c. schraefel
19 shared
Max Van Kleek
19 shared
Vineet Sinha
Patna Medical College and Hospital
17 shared

Education

Ph.D., Computer Science
Stanford University
1995
A.B., Computer Science
Harvard University
1989

Resume-aware match score
Save to shortlist
AI-drafted outreach

See your match with David Karger

PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.

Join the waitlist How it works

Free to start
No credit card
30-second signup

Find professors who actually fit you