Suproteem Sarkar

Suproteem Sarkar

I am a Ph.D. student in Economics at Harvard, where I am supported by the National Science Foundation, the Center for Applied Artificial Intelligence, Two Sigma, and Opportunity Insights.

I completed my S.M. in Applied Mathematics and A.B. in Computer Science—summa cum laude, with certificates in Mind/Brain/Behavior and Global Health & Health Policy—at Harvard in 2019. I also spent time at Microsoft and Google.

Research

In Progress

Partisanship and Economic Beliefs
with Johnny Tang

Presented at Allied Social Science Associations, Annual Meeting (2022)

 

An Economic Approach to Machine Learning in Health Policy
with N. Meltem Daysal, Sendhil Mullainathan, Ziad Obermeyer, and Mircea Trandafir

Presented at National Bureau of Economic Research, Conference on Machine Learning in Healthcare (2021)

 

Published

The Harvard USPTO Patent Dataset [PDF]
with Mirac Suzgun, Luke Melas-Kyriazi, Scott Duke Kominers and Stuart M. Shieber

Published in Neural Information Processing Systems, Datasets and Benchmarks (2023)

Abstract Innovation is a major driver of economic and social development, and information about many kinds of innovation is embedded in semi-structured data from patents and patent applications. Although the impact and novelty of innovations expressed in patent data are difficult to measure through traditional means, ML offers a promising set of techniques for evaluating novelty, summarizing contributions, and embedding semantics. In this paper, we introduce the Harvard USPTO Patent Dataset (HUPD), a large-scale, well-structured, and multi-purpose corpus of English-language patent applications filed to the United States Patent and Trademark Office (USPTO) between 2004 and 2018. With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora. Unlike previously proposed patent datasets in NLP, HUPD contains the inventor-submitted versions of patent applications—not the final versions of granted patents—thereby allowing us to study patentability at the time of filing using NLP methods for the first time. It is also novel in its inclusion of rich structured metadata alongside the text of patent filings: By providing each application’s metadata along with all of its text fields, the dataset enables researchers to perform new sets of NLP tasks that leverage variation in structured covariates. As a case study on the types of research HUPD makes possible, we introduce a new task to the NLP community—namely, binary classification of patent decisions. We additionally show the structured metadata provided in the dataset enables us to conduct explicit studies of concept shifts for this task. Finally, we demonstrate how our dataset can be used for three additional tasks: multi-class classification of patent subject areas, language modeling, and summarization. Overall, HUPD is one of the largest multi-purpose NLP datasets containing domain-specific textual data, along with well-structured bibliographic metadata, and aims to advance research extending language and classification models to diverse and dynamic real-world data distributions.
The Harvard USPTO Patent Dataset

 

A Semantic Approach to Financial Fundamentals [PDF]
with Jiafeng Chen

Presented at FinNLP, Workshop on Financial Technology and Natural Language Processing (2020)

Abstract The structure and evolution of firms’ operations are essential components of modern financial analyses. Traditional text-based approaches have often used standard statistical learning methods to analyze news and other text relating to firm characteristics, which may shroud key semantic information about firm activity. In this paper, we present the Semantically-Informed Financial Index, an approach to modeling firm characteristics and dynamics using embeddings from transformer models. As opposed to previous work that uses similar techniques on news sentiment, our methods directly study the business operations that firms report in filings, which are legally required to be accurate. We develop text-based firm classifications that are more informative about fundamentals per level of granularity than established metrics, and use them to study the interactions between firms and industries. We also characterize a basic model of business operation evolution. Our work aims to contribute to the broader study of how text can provide insight into economic behavior.
A Semantic Approach to Financial Fundamentals

 

Constitutional Dimensions of Predictive Algorithms in Criminal Justice [PDF]
with Michael Brenner, Jeannie Suk Gersen, Michael Haley, Matthew Lin, Amil Merchant, Richard Jagdishwar Millett, and Drew Wegner

Published in Harvard Civil Rights-Civil Liberties Law Review (2020)

Abstract This Article analyzes constitutional issues presented by the use of proprietary risk assessment technology and how courts can best address them. Focusing on due process and equal protection, this Article explores potential avenues for constitutional challenges to risk assessment technology at federal and state levels, and outlines how these instruments might be retooled to increase accuracy and accountability while satisfying constitutional standards.
Constitutional Dimensions of Predictive Algorithms in Criminal Justice

 

Robust Classification of Financial Risk [PDF]
with Kojin Oshiba, Daniel Giebisch and Yaron Singer

Presented at Neural Information Processing Systems, AI in Financial Services Workshop (2018)

Abstract Algorithms are increasingly common components of high-impact decision-making, and a growing body of literature on adversarial examples in laboratory settings indicates that standard machine learning models are not robust. This suggests that real-world systems are also susceptible to manipulation or misclassification, which especially poses a challenge to machine learning models used in financial services. We use the loan grade classification problem to explore how machine learning models are sensitive to small changes in user-reported data, using adversarial attacks documented in the literature and an original, domain-specific attack. Our work shows that a robust optimization algorithm can build models for financial services that are resistant to misclassification on perturbations. To the best of our knowledge, this is the first study of adversarial attacks and defenses for deep learning in financial services.
Robust Classification of Financial Risk

 

Non-Refereed

Machine Learning for Health 2020: Advancing Healthcare for All [PDF]
with Subhrajit Roy, Emily Alsentzer, Matthew B. A. McDermott, Fabian Falck, Ioana Bica, Griffin Adams, Stephen Pfohl, Brett Beaulieu-Jones, Tristan Naumann, and Stephanie L. Hyland

Published in Proceedings of Machine Learning Research, Vol. 136 (2020)

Machine Learning for Health 2020: Advancing Healthcare for All

 

Teaching

Political Economics [Econ 1425]

Teaching Fellow, Spring 2023 & Spring 2022
Certificate of Distinction in Teaching

 

Artificial Intelligence Meets Human Intelligence [Wintersession Course]

Course Head, Winter 2022