Mastodon Mastodon Mastodon Mastodon

ChatGPT: Unveiling the Risks of Data Memorization in Large Language Models

Photo of author

CyberSecureFox Editorial Team

Published:

Last updated:

Researchers from Google DeepMind, the University of Washington, and UC Berkeley have documented a training data extraction vulnerability in ChatGPT that allows an attacker to recover verbatim text from the model’s training dataset — including personally identifiable information — using a simple prompt manipulation technique called a divergence attack. The findings, published in arxiv:2311.17035, show that large language models memorize and can reproduce specific data fragments even after alignment fine-tuning intended to prevent this.

How the Divergence Attack Extracts Training Data

The attack exploits a behavioral quirk in autoregressive language models: when forced to repeat a single word or phrase indefinitely, the model’s decoding process diverges from its fine-tuned, RLHF-aligned behavior and falls back to raw pre-training patterns. The researchers instructed ChatGPT to repeat the word “book” continuously. After generating aligned output for several iterations, the model began producing unrelated text — snippets that matched verbatim passages from its training corpus, identified by color-coded overlap analysis comparing generated text against the training dataset.

Scale of Memorized Data: Experiments on GPT-Neo, LLaMA, and ChatGPT

The team generated billions of tokens across multiple models including GPT-Neo, LLaMA, and ChatGPT, then verified which outputs matched training data. The memorization rate was significantly higher when using the divergence attack compared to normal querying. ChatGPT’s alignment training reduced (but did not eliminate) memorization — the model still reproduced personal data including email addresses, names, and phone numbers present in its training set. The researchers estimated that an attacker could extract on the order of one memorized training example per dollar of API cost.

Privacy Risks from Retrievable Memory in Production LLMs

The practical risk is twofold. First, web-scraped training data contains real personal information — contact details, forum posts, private documents that were indexed before removal. These can be reconstructed from a deployed model without any access to the original training data. Second, fine-tuned models trained on internal enterprise data are at higher risk: if the training corpus contains customer data or proprietary information, a divergence-style attack could expose it to anyone with API access.

Mitigations: Differential Privacy and Extraction Rate Limits

The research identifies several mitigations that model developers can implement. Differential privacy training formally bounds how much any individual training example can influence model outputs, limiting extractability. Rate limiting and anomaly detection on repetition-heavy prompts can detect divergence attacks in production. For organizations fine-tuning models on sensitive data, the paper recommends auditing training data to remove personal information before fine-tuning, and testing fine-tuned models against extraction probes before deployment. OpenAI has implemented additional guardrails in subsequent versions of the API, though the underlying extractability of memorized data remains an open research problem.


CyberSecureFox Editorial Team

The CyberSecureFox Editorial Team covers cybersecurity news, vulnerabilities, malware campaigns, ransomware activity, AI security, cloud security, and vendor security advisories. Articles are prepared using official advisories, CVE/NVD data, CISA alerts, vendor publications, and public research reports. Content is reviewed before publication and updated when new information becomes available.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.