Suproteem Sarkar

Suproteem Sarkar

Economics & Computer Science at Harvard

I am a Ph.D. student in Economics at Harvard, where I am supported by the NSF and Opportunity Insights.

I completed my S.M. in Applied Mathematics and A.B. in Computer Science—summa cum laude, with certificates in Mind/Brain/Behavior and Global Health & Health Policy—from Harvard in 2019. I also spent time at Microsoft and Google.


Published and Forthcoming

Constitutional Dimensions of Predictive Algorithms in Criminal Justice [PDF]
with Michael Brenner, Jeannie Suk Gersen, Michael Haley, Matthew Lin, Amil Merchant, Richard Jagdishwar Millett, and Drew Wegner

Published in Harvard Civil Rights-Civil Liberties Law Review (2020)

Abstract This Article analyzes constitutional issues presented by the use of proprietary risk assessment technology and how courts can best address them. Focusing on due process and equal protection, this Article explores potential avenues for constitutional challenges to risk assessment technology at federal and state levels, and outlines how these instruments might be retooled to increase accuracy and accountability while satisfying constitutional standards.
Constitutional Dimensions of Predictive Algorithms in Criminal Justice


A Semantic Approach to Financial Fundamentals [PDF]
with Jiafeng Chen

Presented at FinNLP, Workshop on Financial Technology and Natural Language Processing (2020)

Abstract The structure and evolution of firms’ operations are essential components of modern financial analyses. Traditional text-based approaches have often used standard statistical learning methods to analyze news and other text relating to firm characteristics, which may shroud key semantic information about firm activity. In this paper, we present the Semantically-Informed Financial Index, an approach to modeling firm characteristics and dynamics using embeddings from transformer models. As opposed to previous work that uses similar techniques on news sentiment, our methods directly study the business operations that firms report in filings, which are legally required to be accurate. We develop text-based firm classifications that are more informative about fundamentals per level of granularity than established metrics, and use them to study the interactions between firms and industries. We also characterize a basic model of business operation evolution. Our work aims to contribute to the broader study of how text can provide insight into economic behavior.
A Semantic Approach to Financial Fundamentals


Universal Causal Evaluation Engine [PDF]
with Alexander Lin * Amil Merchant and Alexander D’Amour

Presented at ACM KDD, Causal Discovery Workshop (2019)
Published in Proceedings of Machine Learning Research, Vol. 104

Abstract A major driver in the success of predictive machine learning has been the "common task framework," where community-wide benchmarks are shared for evaluating new algorithms. This pattern, however, is difficult to implement for causal learning tasks because the ground truth in these tasks is in general unobservable. Instead, causal inference methods are often evaluated on synthetic or semi-synthetic datasets that incorporate idiosyncratic assumptions about the underlying data-generating process. These evaluations are often proposed in conjunction with new causal inference methods—as a result, many methods are evaluated on incomparable benchmarks. To address this issue, we establish an API for generalized causal inference model assessment, with the goal of developing a platform that lets researchers deploy and evaluate new model classes in instances where treatments are explicitly known. The API uses a common interface for each of its components, and it allows for new methods and datasets to be evaluated and saved for future benchmarking.
Universal Causal Evaluation Engine


Robust Classification of Financial Risk [PDF]
with Kojin Oshiba * Daniel Giebisch and Yaron Singer

Presented at Neural Information Processing Systems, AI in Financial Services Workshop (2018)

Abstract Algorithms are increasingly common components of high-impact decision-making, and a growing body of literature on adversarial examples in laboratory settings indicates that standard machine learning models are not robust. This suggests that real-world systems are also susceptible to manipulation or misclassification, which especially poses a challenge to machine learning models used in financial services. We use the loan grade classification problem to explore how machine learning models are sensitive to small changes in user-reported data, using adversarial attacks documented in the literature and an original, domain-specific attack. Our work shows that a robust optimization algorithm can build models for financial services that are resistant to misclassification on perturbations. To the best of our knowledge, this is the first study of adversarial attacks and defenses for deep learning in financial services.
Robust Classification of Financial Risk


In Progress

A Machine Learning Approach to Breast Cancer Screening
with N. Meltem Daysal, Sendhil Mullainathan, Ziad Obermeyer, and Mircea Trandafir

Presented at National Bureau of Economic Research, Conference on Machine Learning in Healthcare (2021)

Abstract We use machine learning to predict the value of preventive cancer screens and discuss how our results might inform screening policies. As the ultimate value of a preventive intervention is often tied to underlying health risks, machine predictions can help individualize value measures that are typically compared using averages. Focusing on preventive mammography, which aims to find breast cancers when they are easy to treat, we train and evaluate machine learning models for predicting cancer risk. On Danish administrative data, we find that reallocating cancer screens using these models, compared to policies that raise minimum invitation ages, could catch more tumors or reduce overall screening. Leveraging the staggered rollout of invited mammography programs across Denmark, we also find that screening high-risk women could diminish the long-term incidence of large tumors—while reducing the overdiagnosis of cancers that would likely not become deadly. We find that label choice is key in this setting—training models to predict cancer-related health outcomes can better identify those for whom screening is of high value than training models to predict positive mammography results.



Machine Learning for Health 2020: Advancing Healthcare for All [PDF]
with Subhrajit Roy, Emily Alsentzer, Matthew B. A. McDermott, Fabian Falck, Ioana Bica, Griffin Adams, Stephen Pfohl, Brett Beaulieu-Jones, Tristan Naumann, and Stephanie L. Hyland

Published in Proceedings of Machine Learning Research, Vol. 136 (2020)

Machine Learning for Health 2020: Advancing Healthcare for All