Anna Hedström

  • Postdoc
  • anna.hedstroem@ai.ethz.ch
  • OAT X19.1
  • External Website
  • My research explores the intersection of evaluation-centric interpretability and alignment for the control and safety of large language models. I aim to develop principled methods that transform mechanistic understanding of models into signals for steering and post-training, enabling preventative safeguards and mitigation of emergent misalignment.