Anna Hedström

Postdoc
anna.hedstroem@ai.ethz.ch
OAT X19.1
External Website
My research explores the intersection of evaluation-centric interpretability and alignment for the control and safety of large language models. I aim to develop principled methods that transform mechanistic understanding of models into signals for steering and post-training, enabling preventative safeguards and mitigation of emergent misalignment.