Olivia Grace Watkins

oliviawatkins @ berkeley . edu

Who am I?

I am a SOTA neural network.

I have been training in a continual learning setting for more than two decades.
In 2019, I did rapid domain adaptation to the OOD environment of Berkeley grad school.
I have successfully learned collaboration in the multi-agent environment of BAIR (Berkeley AI Research).
I incorporate human-in-the-loop supervision from my advisors Pieter Abbeel and Trevor Darrell.
I am capable of multi-modal input and output, including vision, research papers, audio, natural language, research papers, and research papers.
I'm robust against all adversarial inputs except chocolate.
I achieve near-human performance on all Atari games.

Reviewer Concerns:

Approach is not replicable; has only been run on one seed.
There are serious privacy concerns with the online data collection method, which includes substantial personally identifying information.
Algorithm may incorporate human biases.
Source code has been released but is unintelligible; uses only four variable names (ATCG)
Couldn't you just use a tranformer for this?

What are my research interests?

I'm excited about designing agents which can learn from humans, reason correctly in language, solve open-ended problems, and act safely and reliably in the world. Interesting research question in this space include:

How can we design agents which can learn efficiently from supervision (both from humans and (V)LLMs with common-sense understanding)?
Can designing agents which reason in language enable generalization and make it easier for humans to supervise and correct agents?
How can we enable language agents to learn from experience while maintaining correct, common-sense reasoning?
How can we design agents which can act safely and robustly on the web (and in similar sensitive envs), especially in the presence of adversaries?

Do you have a life outside of research?

In my spare time I play Quidditch and D&D, hang out with friends, make mediocre puns, and procrastinate on keeping my website up to date.

publications

Under Review ICML

A StrongREJECT for Empty Jailbreaks

Alexandra Souly*, Qingyuan Lu*, Dillon Bowen*, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons*, Olivia Watkins*, Sam Toyer*

Under Review at ICML 2024

Summary Paper Code

"Jailbreaks" let people use LLMs maliciously. We show that prior jailbreak benchmarks have issues (vague or unanswerable questions, grading criteria that are give full credit for low-quality model response, etc.) Some jailbreak techniques make the problem worse by decreasing the quality of model responses even on benign questions: we show that several jailbreaking techniques substantially reduce the zero-shot performance of GPT-4 on MMLU. Our benchmark, StrongREJECT uses a higher-quality question set and a more accurate response grading algorithm. Our new grading scheme better matches with human judgment, especially on the sort of low-quality responses that contribute the most to over-estimation of jailbreak performance on existing benchmarks.
Under Review ICML

Learning to Model the World with Language

Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, and Anca Dragan

Under Review at ICML 2024

Summary Paper Code

Language helps agents predict the future: what will be observed, how the world will behave, and which situations will be rewarded. This perspective unifies language understanding with future prediction as a powerful self-supervised learning objective. Our agent, Dynalang, learns a multimodal world model that predicts future text and image representations. Dynalang is able to understand rich language forms and is amenable to text-only or video-only pretraining.
ICLR 2024

Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart Russell

ICLR 2024

Summary Paper Code

Prompt injection is a security vulnerability in which a language model with access to untrusted input (e.g. web text) can be manipulated to follow an adversary's instructions. We collect human-designed prompt injection attacks through an online game tensortrust.ai. We curate these attacks to make a benchmark and show that while GPT4 is the best currently-available model (it scores best on both instruction-following abilities and prompt injection robustness) this is far from a solved problem.
ICML

Guiding Pretraining in Reinforcement Learning with Large Language Models

Du*, Yuqing Watkins*, Olivia, Wang, Zihan, Colas, Cédric, Darrel, Trevor, Abbeel, Pieter, Gupta, Abhishek, and Andreas, Jacob,

ICML 2023

Summary Paper Code

In open-ended environments, current unsupervised exploration methods for RL will spend most of their time exploring actions which are nonsensical or undesirable. We add human common-sense priors to the exploration process by captioning the agent's current state in language and querying a large language model (LLM) for suggestions on useful affordances to explore in the environment. We reward the agent for achiving novel goals suggested by the LLM.
NeurIPS

DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, Kimin Lee

NeurIPS 2023

Summary Paper Code

We apply reinforcement learning to fine-tune text-to-image diffusion models, using open-source reward models to score outputs. We find that RL fine-tuning outperforms supervised finetuning. We ablate different design choices for RL fine-tuning.
arXiv

Aligning Text-to-Image Models using Human Feedback

Lee, Kimin; Liu Hao; Ryu, Moonkyung; Watkins, Olivia; Du, Yuqing; Boutilier, Craig; Abbeel, Pieter; Ghavamzadeh, Mohammad; Gu, Shixiang Shane

arXiv 2023

Summary Paper

We explore the challenges and opportunities in using human feedback to finetune Stable Diffusion, a text-to-vision generative model. We observe that by training a correctness classifier on human feedback we can improve prompt-image alignment through rejection sampling. We can improve text-image alignment further through finetuning, but at the cost of decreases in diversity and image quality.
NeurIPS

Teachable Reinforcement Learning via Advice Distillation

Watkins, Olivia, Darrel, Trevor, Abbeel, Pieter, Andreas, Jacob, and Gupta, Abhishek

NeurIPS 2021 2021

Summary Paper Code

We introduce a human-in-the-loop supervision scheme in which an agent first learns to interpret human advice, which can be composed to coach the agent through complex tasks. The agent then distills an advice-independent policy. This lets us train policies on new tasks with less human effort than would be require to train policies through behavioral cloning or reinforcement learning.
ICRA

Auto-Tuned Sim-to-Real Transfer

Du *, Yuqing; Watkins*, Olivia; Darrell, Trevor; Abbeel, Pieter; and Pathak, Deepak

ICRA 2021

Summary Paper Code

We propose a method for automatically tuning robotic simulator system parameters to match the real world using only raw RGB images. Our key insight is to reframe the auto-tuning of parameters as a search problem where we iteratively shift the simulation system parameters to approach the real-world system parameters. We show improvements over domain randomization in simulation and on a real robot.
ICML Workshop

Explaining Reinforcement Learning Policies through Counterfactual Trajectories

Frost, Julius; Watkins, Olivia; Weiner, Eric; Abbeel, Pieter; Darrell, Trevor; Plummer, Bryan; and Saenko, Kate

ICML Workshop on Human in the Loop Learning 2021

Summary Paper Code

We generate videos of agent behavior to show a human who wishes to understand the agent. We select diverse trajectories by using an exploration to seek out diverse start states. This leads to slight performance improvement on one of two user studies, but in general we find neither this method nor the baselines do much to help users understand agent policies.
ICNLP

Hierarchical text generation using an outline

Drissi, Mehdi; Watkins, Olivia; and Kalita, Jugal

International Conference on Natural Language Processing 2018

Summary Paper

We propose a method to improve the generation of coherent long-form text by having the model first generate an outline, then generate full text conditioned on the outline. This improves perplexity but does not improve human evaluation.
ICML Workshop

Program language translation using a grammar-driven tree-to-tree model

Drissi*, Mehdi; Watkins*, Olivia; Khant, Aditya; Ojha, Vivaswat; Sandoval, Pedro; Segev, Rakia; Weiner, Eric; and Keller, Robert

ICML Workshop on Neural Abstract Machines & Program Induction 2018

Summary Paper Code

We modify existing encoder/decoder approaches for translation between programming language abstract syntax trees by using a grammar-aware decoder which is constrained to only generate syntactically correct programs. This improves performance on a couple synthetic tasks.