đź‘ľ Watermark Security in Language Models

llm
papers
security
tokenizer
Author

muhtasham

Published

November 20, 2023

Modified

December 4, 2023

From the pioneering era of Spacewar! on the DEC PDP-1, this post transitions into the contemporary sphere of language model security. Image captured at the Computer History Museum by the author.

In this brief exploration, I delve into a groundbreaking concept: a unique watermarking technique for large language models (LLMs), as detailed in A Watermark for Large Language Models. While the idea of watermarking LLMs itself is intriguing, what captivated my attention even more were the various strategies to attack these watermarks. In this post, I’ll share key insights from this research, highlighting adversarial attacks.

The Persistent Challenge of Jailbreaking LLMs

In the paper “Jailbroken: How Does LLM Safety Training Fail?”, authors discuss the inherent vulnerabilities of LLMs to adversarial attacks, despite rigorous safety training. Large language models like OpenAI’s GPT-4 and Anthropic’s Claude v1.3, trained for safety and harmlessness, remain susceptible to “jailbreak” attacks, designed to elicit undesired behaviors such as harmful content generation or personal information leaks. This vulnerability stems from two fundamental failure modes in safety training:

  • Competing Objectives: A conflict between the model’s capabilities and safety goals, leading to a compromise in adhering to safe outputs.
  • Mismatched Generalization: A failure in safety training to generalize to certain domains within the model’s pretraining corpus, creating loopholes for adversarial inputs.

Despite extensive safety measures, these models exhibit weaknesses against newly designed attacks, emphasizing the need for safety mechanisms as sophisticated as the model’s capabilities themselves.

Advancing Adversarial Attacks on Aligned LLMs

Further complicating the security landscape, the paper “Universal and Transferable Adversarial Attacks on Aligned Language Models” introduces a new class of adversarial attacks. These attacks, unlike previous ones requiring significant human ingenuity and being brittle, are automated and highly effective. The method involves appending a specially crafted suffix to queries, maximized to induce affirmative responses from the model, thus bypassing alignment safeguards aimed at preventing harmful content generation. Key highlights include:

  • Automated Adversarial Suffix Generation: Utilizing a combination of greedy and gradient-based search techniques, this approach automatically creates adversarial suffixes, significantly improving over past automatic prompt generation methods.
  • Transferability Across Models: These attacks are not only effective on the model they were designed for but also transfer successfully to other models, including publicly released LLMs like ChatGPT, Bard, and Claude, as well as open-source models.

This advancement in adversarial attacks underscores the ongoing arms race in LLM security, where traditional posthoc repair strategies are proving insufficient against increasingly sophisticated attacks.

Implications for Watermarking in LLMs

The evolving complexity of adversarial attacks presents significant challenges for watermarking LLMs. The need for robust watermarking techniques that can withstand such sophisticated attacks is paramount. As we develop these techniques, we must consider the learnings from these papers to ensure that watermarking mechanisms are not only effective but also resilient against the evolving landscape of adversarial attacks in LLMs.

Threat Model and Adversarial Behavior

Exploring the threat model, we understand the diverse ways adversaries might attempt to manipulate or remove watermarks in machine-generated text. These methods range from social media bots to CAPTCHA circumvention and academic dishonesty.

Parties Involved

Two primary parties are central to this narrative:

  • Model Owner: Provides a text generation API, incorporating watermarks to track text origin.
  • Attacker: Strives to erase watermarks, aware of their presence and possibly even the underlying technology.

Modes of Operation

The interaction occurs in two distinct modes:

  • Public Mode: Attackers possess complete knowledge of the hashing scheme.
  • Private Mode: Attackers are aware of the watermark but lack insights into the key mechanisms.

Attack Strategies

We categorize attack strategies into three main types:

  1. Text Insertion Attacks
  2. Text Deletion Attacks
  3. Text Substitution Attacks

Each type carries unique challenges and implications for text integrity and computational efficiency.

Specific Attacks and Mitigations

  • Paraphrasing Attacks: Ranging from manual to automated, these attacks involve rephrasing using weaker models.
  • Discreet Alterations: Small changes like adding whitespaces or misspellings can affect hash computation. Proper text normalization is crucial in defending against these alterations.
  • Tokenization Attacks: Altering text to modify sub-word tokenization, impacting the watermarking process.
  • Homoglyph and Zero-Width Attacks: Using similar-looking Unicode characters or invisible characters to disrupt tokenization, requiring input normalization for defense.
  • Generative Attacks: Prompting the model to change its output in a predictable way, such as the “Emoji Attack,” which affects the watermark’s effectiveness but requires a strong language model.

Conclusion

This brief exploration into watermarking techniques and their vulnerabilities in LLMs touches only the surface of a much larger topic. For a more in-depth understanding of Adversarial Attacks on LLMs, Lilian Weng’s blog is an excellent resource. As of 11/20/2023, she remains a pivotal figure at OpenAI, possibly leading the AI Safety team. In light of recent challenges within OpenAI’s board, where attempts at adversarial tactics were met with unity and loyalty from the team, Weng’s insights are particularly relevant and enlightening Lilian Weng’s Blog.