Sankaran Vaidyanathan

I am a PhD student at the College of Information and Computer Sciences, UMass Amherst, where I am advised by David Jensen. My research spans the areas of causal inference, probabilistic machine learning, and reinforcement learning. More recently, I have also worked on approaches for mechanistic interpretability in LLMs.

I aim to create tools for analyzing and evaluating the behaviour of complex AI systems, with a focus on problems in blame and responsibility attribution, explainability, and alignment with human norms. Unlike most applications of causal inference that involve objective experimentation and interaction with the external world, these issues are traditionally grounded in subjective human judgments. These involve norms that can be very counterintuitive, and pose a significant challenge to purely statistical approaches in causal inference. By developing formal approaches for modeling norms and inference algorithms that align with norms, I hope to support open and scientific evaluation and auditing of AI systems, and the growth of AI systems that better align with norms.

selected publications

arXiv
Automated Discovery of Functional Actual Causes in Complex Environments

Caleb Chuck^*, Sankaran Vaidyanathan^*, Stephen Giguere, and 3 more authors

arXiv preprint arXiv:2404.10883, 2024

Abstract arXiv Bib PDF

Reinforcement learning (RL) algorithms often struggle to learn policies that generalize to novel situations due to issues such as causal confusion, overfitting to irrelevant factors, and failure to isolate control of state factors. These issues stem from a common source: a failure to accurately identify and exploit state-specific causal relationships in the environment. While some prior works in RL aim to identify these relationships explicitly, they rely on informal domain-specific heuristics such as spatial and temporal proximity. Actual causality offers a principled and general framework for determining the causes of particular events. However, existing definitions of actual cause often attribute causality to a large number of events, even if many of them rarely influence the outcome. Prior work on actual causality proposes normality as a solution to this problem, but its existing implementations are challenging to scale to complex and continuous-valued RL environments. This paper introduces functional actual cause (FAC), a framework that uses context-specific independencies in the environment to restrict the set of actual causes. We additionally introduce Joint Optimization for Actual Cause Inference (JACI), an algorithm that learns from observational data to infer functional actual causes. We demonstrate empirically that FAC agrees with known results on a suite of examples from the actual causality literature, and JACI identifies actual causes with significantly higher accuracy than existing heuristic methods in a set of complex, continuous-valued environments.
@article{chuck2024automated, title = {Automated Discovery of Functional Actual Causes in Complex Environments}, author = {Chuck, Caleb and Vaidyanathan, Sankaran and Giguere, Stephen and Zhang, Amy and Jensen, David and Niekum, Scott}, journal = {arXiv preprint arXiv:2404.10883}, year = {2024}, }
arXiv
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Aman Singh Thakur^*, Kartik Choudhary^*, Venkat Srinik Ramayapally^*, and 2 more authors

arXiv preprint arXiv:2406.12624, 2024

Abstract arXiv Bib PDF

Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges, focusing on a clean scenario in which inter-human agreement is high. Investigating thirteen judge models of different model sizes and families, judging answers of nine different ’examtaker models’ - both base and instruction-tuned - we find that only the best (and largest) models achieve reasonable alignment with humans. However, they are still quite far behind inter-human agreement and their assigned scores may still differ with up to 5 points from human-assigned scores. In terms of their ranking of the nine exam-taker models, instead, also smaller models and even the lexical metric contains may provide a reasonable signal. Through error analysis and other studies, we identify vulnerabilities in judge models, such as their sensitivity to prompt complexity and length, and a tendency toward leniency. The fact that even the best judges differ from humans in this comparatively simple setup suggest that caution may be wise when using judges in more complex setups. Lastly, our research rediscovers the importance of using alignment metrics beyond simple percent alignment, showing that judges with high percent agreement can still assign vastly different scores.
@article{thakur2024judging, title = {Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges}, author = {Thakur, Aman Singh and Choudhary, Kartik and Ramayapally, Venkat Srinik and Vaidyanathan, Sankaran and Hupkes, Dieuwke}, journal = {arXiv preprint arXiv:2406.12624}, year = {2024}, }
arXiv
Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability

Jatin Nainani^*, Sankaran Vaidyanathan^*, AJ Yeung, and 2 more authors

arXiv preprint arXiv:2411.16105, 2024

Abstract arXiv Bib PDF

Mechanistic interpretability aims to understand the inner workings of large neural networks by identifying circuits, or minimal subgraphs within the model that implement algorithms responsible for performing specific tasks. These circuits are typically discovered and analyzed using a narrowly defined prompt format. However, given the abilities of large language models (LLMs) to generalize across various prompt formats for the same task, it remains unclear how well these circuits generalize. For instance, it is unclear whether the models generalization results from reusing the same circuit components, the components behaving differently, or the use of entirely different components. In this paper, we investigate the generality of the indirect object identification (IOI) circuit in GPT-2 small, which is well-studied and believed to implement a simple, interpretable algorithm. We evaluate its performance on prompt variants that challenge the assumptions of this algorithm. Our findings reveal that the circuit generalizes surprisingly well, reusing all of its components and mechanisms while only adding additional input edges. Notably, the circuit generalizes even to prompt variants where the original algorithm should fail; we discover a mechanism that explains this which we term S2 Hacking. Our findings indicate that circuits within LLMs may be more flexible and general than previously recognized, underscoring the importance of studying circuit generalization to better understand the broader capabilities of these models.
@article{nainani2024adaptive, title = {Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability}, author = {Nainani, Jatin and Vaidyanathan, Sankaran and Yeung, AJ and Gupta, Kartik and Jensen, David}, journal = {arXiv preprint arXiv:2411.16105}, year = {2024}, }