Projects

Research Direction: Evaluations, legal alignment, and other technical governance topics

I am interested in working on projects in the following areas. Most plausibly I would be interested in running a group project on science of evaluations or on legal alignment, though the specific project will depend on the group's interests. I am also open to supervising individual governance projects.

Science of evaluations: How do we build better evaluations for alignment, capabilities, and beyond? Some specific projects I would be interested in:

  • Can we systematically assess the effect and sources of eval awareness for alignment and capability evals? E.g., how do various post-training interventions or prompt artifacts affect eval awareness across different benchmarks?

  • How can we assess the ecological validity of benchmarks and evals? E.g., how do we know that our benchmarks generalize to real-world behavior and/or say something meaningful about risk in real-world settings? See here and here.

Legal alignment evaluations: Can we build automated assessments for AI agents' abilities and propensities to comply with legal requirements? See 4.1 and somewhat 4.2 here; for examples of work in this direction, see here and here. Some potential projects include:

  • How well do LLMs comply with model specs/AI constitutions? How do models interpret these documents, and do their interpretations or behaviors differ from human interpretations?

  • Laws can have hierarchies, analogous to instruction hierarchies, eg, a US federal law taking precedent over a state law. Do models respect these hierarchies?

  • When performing tasks that may have legal implications, do models a) recognize that their are legal implications, b) recognize the correct areas of law at issue, and c) correctly retrieve the relevant legal texts (eg from a given database/document)?

  • Many laws are justified by principles of deterrence, the idea that actors may be unwilling to take actions due to legally imposed costs/penalties. What types of penalties could "deter" LLMs, and how do different penalties rank relative to each other?

  • Build an evaluation for AI agents' compliance with the law (for a specific legal doctrine or body of law)

AI governance: I am generally open to supervising AI governance projects, especially projects that (1) require technical understanding to execute, (2) are related to U.S. legal implementation or challenges, or (3) use quantitative or empirical social science methods to answer questions relevant for frontier AI governance (e.g., survey methods). I am particularly interested in eval-related policy. Some specific questions could be:

  • What are the effects of inference scaling on AI governance? How do inference scaling, scaffolding, and multi-agent systems change existing governance paradigms?

  • How can we build out a robust eval ecosystem?

  • What is the legal effect of evals under current law (e.g., as evidence in a tort lawsuit)?

  • How can we verify eval results and implementation, and how do we think about eval certification across different countries?

I am also open to supervising topics in Chinese AI governance or US–China relations, but would prefer to supervise projects in this area only if they relate to (1) through (3) above.

What I'm looking for in a Mentee

Candidates interested in science of evals or evals projects should be familiar with Python and have previous research experience in machine learning. Familiarity with statistical modelling/causal inference and with evaluation frameworks such as Inspect are a plus (but not required).

Candidates interested in legal alignment should have or be in the process of completing a law degree (JD, LLB, LLM, etc).

Candidates interested in governance projects should have prior experience in public policy (research, advocacy, etc. in any field, including but not limited to AI policy).

All candidates should have an interest in AI safety and a desire to publish an academic paper, blog post/technical blog post, or policy report.

What I'm Like as a Mentor

I am typically relatively hands-off and will provide only high-level guidance, but depending on the projects chosen this summer, I may be more involved. I generally prefer to work async and appreciate planning out projects ahead + having a good sense of theory of change before starting a project.

Bio

Kevin Wei (he/they) is an adjunct in RAND's Center for AI, Security, and Technology. They were previously a Fellow at RAND and a visiting researcher on the UK AI Security Institute's science of evaluations team. Kevin's work has been published in ICML, AIES, FAccT, TMLR, AAAI, and other venues. Kevin has a Master's in Global Affairs from Tsinghua University, an M.S. in Machine Learning from Georgia Tech, and a B.A. in Mathematics-Statistics & Economics from Columbia.

Previous
Previous

Erich Grunewald

Next
Next

Damiano Fornasiere & Gaël Gendron