
Shourish Chakravarty
· Assistant Professor Southern Piedmont ARECVerifiedVirginia Tech · Agricultural and Applied Economics
Active 2016–2021
About
Shourish Chakravarty is an Assistant Professor in the Department of Agricultural and Applied Economics at Virginia Tech. He completed his Ph.D. in Agricultural Economics from the Food and Resource Economics Department at the University of Florida. His core research interests involve applying economic tools to analyze agricultural production, technology adoption, conservation practices, and crop profitability. His primary goal is to deliver timely, relevant, and comprehensive insights on agricultural productivity, profitability, and sustainability to farmers and stakeholders in Virginia. A significant aspect of his work involves direct communication with agricultural stakeholders, particularly those involved in plant commodities, to understand their challenges and conduct research aimed at addressing them.
Research topics
- Artificial Intelligence
- Computer Science
- Information Retrieval
- Natural Language Processing
- Law
- Psychology
Selected publications
Aspect Classification for Legal Depositions
arXiv (Cornell University) · 2021
1st authorCorresponding- Computer Science
- Computer Science
- Information Retrieval
ICAIL Doctoral Consortium, Montreal 2019
Artificial Intelligence and Law · 2020-05-26
articleDialog Acts Classification for Question-Answer Corpora.
2019-01-01 · 7 citations
article1st authorCorrespondingParaplegia in Subarachnoid Hemorrhage Complicated With Coarctation of Aorta: An Unusual Complication
Journal of Neurosurgical Anesthesiology · 2017-08-16 · 1 citations
letterDepartment of Neuroanaesthesia and neurocritical care, Artemis Hospital Gurgaon, Haryana, India The authors have no funding or conflicts of interest to disclose.
Classifying short unstructured data using the Apache Spark platform
ACM/IEEE Joint Conference on Digital Libraries · 2017-06-19 · 3 citations
articlePeople worldwide use Twitter to post updates about the events that concern them directly or indirectly. Study of these posts can help identify global events and trends of importance. Similarly, E-commerce applications organize their products in a way that can facilitate their management and satisfy the needs and expectations of their customers. However, classifying data such as tweets or product descriptions is still a challenge. These data are described by short texts, containing in their vocabulary abbreviations of sentences, emojis, hashtags, implicit codes, and other non-standard usage of written language. Consequently, traditional text classification techniques are not effective on these data. In this paper, we describe our use of the Spark platform to implement two classification strategies to process large data collections, where each datum is a short textual description. One of our solutions uses an associative classifier, while the other is based on a multiclass Logistic Regression classifier using Word2Vec as a feature selection and transformation technique. Our associative classifier captures the relationships among words that uniquely identify each class, and Word2Vec captures the semantic and syntactic context of the words. In our experimental evaluation, we compared our solutions, as well as Spark MLlib classifiers. We assessed effectiveness, efficiency, and memory requirements. The results indicate that our solutions are able to effectively classify the millions of data instances composed of thousands of distinct features and classes, found in our digital libraries.
Classifying Short Unstructured Data Using the Apache Spark Platform
2017-06-01 · 3 citations
articlePeople worldwide use Twitter to post updates about the events that concern them directly or indirectly. Study of these posts can help identify global events and trends of importance. Similarly, E-commerce applications organize their products in a way that can facilitate their management and satisfy the needs and expectations of their customers. However, classifying data such as tweets or product descriptions is still a challenge. These data are described by short texts, containing in their vocabulary abbreviations of sentences, emojis, hashtags, implicit codes, and other non-standard usage of written language. Consequently, traditional text classification techniques are not effective on these data. In this paper, we describe our use of the Spark platform to implement two classification strategies to process large data collections, where each datum is a short textual description. One of our solutions uses an associative classifier, while the other is based on a multiclass Logistic Regression classifier using Word2Vec as a feature selection and transformation technique. Our associative classifier captures the relationships among words that uniquely identify each class, and Word2Vec captures the semantic and syntactic context of the words. In our experimental evaluation, we compared our solutions, as well as Spark MLlib classifiers. We assessed effectiveness, efficiency, and memory requirements. The results indicate that our solutions are able to effectively classify the millions of data instances composed of thousands of distinct features and classes, found in our digital libraries.
A Large Collection Learning Optimizer Framework
VTechWorks (Virginia Tech) · 2017-06-30
dissertationOpen access1st authorCorrespondingContent is generated on the web at an increasing rate. The type of content varies from text on a traditional webpage to text on social media portals (e.g., social network sites and microblogs). One such example of social media is the microblogging site Twitter. Twitter is known for its high level of activity during live events, natural disasters, and events of global importance. Challenges with the data in the Twitter universe include the limit of 140 characters on the text length. Because of this limitation, the vocabulary in the Twitter universe includes short abbreviations of sentences, emojis, hashtags, and other non-standard usage. Consequently, traditional text classification techniques are not very effective on tweets. Fortunately, sophisticated text processing techniques like cleaning, lemmatizing, and removal of stop words and special characters will give us clean text which can be further processed to derive richer word semantic and syntactic relationships using state of the art feature selection techniques like Word2Vec. Machine learning techniques, using word features that capture semantic and context relationships, can be of benefit regarding classification accuracy. Improving text classification results on Twitter data would pave the way to categorize tweets relative to human defined real world events. This would allow diverse stakeholder communities to interactively collect, organize, browse, visualize, analyze, summarize, and explore content and sources related to crises, disasters, human rights, inequality, population growth, resiliency, shootings, sustainability, violence, etc. Having the events classified into different categories would help us study causality and correlations among real world events. To check the efficacy of our classifier, we would compare our experimental results with an Association Rules (AR) classifier. This classifier composes its rules around the most discriminating words in the training data. The hierarchy of rules, along with an ability to tune to a support threshold, makes it an effective classifier for scenarios where short text is involved. Traditionally, developing classification systems for these purposes requires a great degree of human intervention. Constantly monitoring new events, and curating training and validation sets, is tedious and time intensive. Significant human capital is required for such annotation endeavors. Also, involved efforts are required to tune the classifier for best performance. Developing and tuning classifiers manually using human intervention would not be a viable option if we are to monitor events and trends in real-time. We want to build a framework that would require very little human intervention to build and choose the best among the available performing classification techniques in our system. Another challenge with classification systems is related to their performance with unseen data. For the classification of tweets, we are continually faced with a situation where a given event contains a certain keyword that is closely related to it. If a classifier, built for a particular event, due to overfitting to what is a biased sample with limited generality, is faced with new tweets with different keywords, accuracy may be reduced. We propose building a system that will use very little training data in the initial iteration and will be augmented with automatically labelled training data from a collection that stores all the incoming tweets. A system that is trained on incoming tweets that are labelled using sophisticated techniques based on rich word vector representation would perform better than a system that is trained on only the initial set of tweets. We also propose to use sophisticated deep learning techniques like Convolutional Neural Networks (CNN) that can capture the combination of the words using an n-gram feature representation. Such sophisticated feature representation could account for the instances when the words occur together. We divide our case studies into two phases: preliminary and final case studies. The preliminary case studies focus on selecting the best feature representation and classification methodology out of the AR and the Word2Vec based Logistic Regression classification techniques. The final case studies focus on developing the augmented semi-supervised training methodology and the framework to develop a large collection learning optimizer to generate a highly performant classifier. For our preliminary case studies, we are able to achieve an F1 score of 0.96 that is based on Word2Vec and Logistic Regression. The AR classifier achieved an F1 score of 0.90 on the same data. For our final case studies, we are able to show improvements of F1 score from 0.58 to 0.94 in certain cases based on our augmented training methodology. Overall, we see improvement in using the augmented training methodology on all datasets.
The Virginia Tech System at CoNLL-2016 Shared Task on Shallow Discourse Parsing
2016-01-01 · 3 citations
articleOpen accessThis paper presents the Virginia Tech system that participated in the CoNLL-2016 shared task on shallow discourse parsing. We describe our end-to-end discourse parser that builds on the methods shown to be successful in previous work. The system consists of several components, such that each module performs a specific subtask, and the components are organized in a pipeline fashion. We also present our efforts to improve several componentsexplicit sense classification and argument boundary identification for explicit and implicit arguments -and present evaluation results. In the closed evaluation, our system obtained an F1 score of 20.27% on the blind test.
CS5604 Fall 2016 Classification Team Final Report
VTechWorks (Virginia Tech) · 2016-12-08
articleOpen accessSenior authorContent is generated on the Web at an exponential rate. The type of content varies from text on a traditional webpage to text on social media portals (e.g., social network sites and microblogs). One such example of social media is the microblogging site Twitter. Twitter is known for its high level of activity during live events, natural disasters, and events of global importance. Improving text classification results on Twitter data would pave the way to categorize the tweets into human defined real world events. This would allow diverse stakeholder communities to interactively collect, organize, browse, visualize, analyze, summarize, and explore content and sources related to crises, disasters, human rights, inequality, population growth, resiliency, shootings, sustainability, violence, etc. Challenges with the data in the Twitter universe include that the text length is limited to 160 characters. Because of this limitation, the vocabulary in the Twitter universe has taken its own form of short abbreviations of sentences, emojis, hashtags, and other non-standard usage of written language. Consequently, traditional text classification techniques are not effective on tweets. Sophisticated text processing techniques like cleaning, lemmatizing, and removal of stop words and special characters will give us clean text which can be further processed to derive richer word semantic and syntactic relationships using state of the art feature selection techniques like Word2Vec. Machine learning techniques using word features that capture semantic and context relationships have been shown to give state of the art classification accuracy. To check the efficacy of our classifier, we would compare our experimental results with an association rules (AR) classifier. This classifier composes its rules around the most discriminating words in the training data. The hierarchy of rules along with an ability to tune to the support threshold makes it an effective classifier for scenarios where short text is involved. We developed a system where we read the tweets from HBase and write the classification label back after the classification step. We use domain oriented pre-processing on the tweets, and Word2Vec as the feature selection and transformation technique. We use a multi-class Logistic Regression algorithm for our classifier. We are able to achieve an F1 score of 0.96 when classifying a test set of 320 tweets across 9 classes. The AR classifier achieved an F1 score of 0.90 with the same data. Our developed system can classify collections of any size by utilizing a 20 node Hadoop cluster in a parallel fashion, through Spark. Our experiments suggest that the high accuracy score for our classifier can be primarily attributed to the pre-processing and feature selection techniques that we used. Understanding the Twitter universe vocabulary helped us frame the text cleaning and pre-processing rules used to eliminate noise from the text. The Word2Vec feature selection technique helps us capture the word contexts in a low dimensional feature space that results in high classification accuracy and low model training time. Utilizing the Spark framework to execute our classification pipeline in a distributed fashion allows us to classify large collections without running into out-of-memory exceptions.
Frequent coauthors
- 7 shared
Edward A. Fox
Virginia Tech
- 4 shared
Maanav Mehrotra
Virginia Tech
- 3 shared
Ilaria Angela Amantea
University of Turin
- 3 shared
Satvik Chekuri
Virginia Tech
- 2 shared
Denilson Alves Pereira
Universidade Federal de Lavras
- 2 shared
Daphne Odekerken
- 2 shared
Raja Venkata Satya Phanindra Chava
- 2 shared
Marie Garin
- Resume-aware match score
- Save to shortlist
- AI-drafted outreach
See your match with Shourish Chakravarty
PhdFit ranks faculty by your research interests, methods, and publications — grounded in their actual work, not templates.
- Free to start
- No credit card
- 30-second signup