ChatGPT And AI Vulnerability: Unraveling Data Memorization Risks

Researchers from Google DeepMind, the University of Washington, and UC Berkeley have documented a training data extraction vulnerability in ChatGPT that allows an attacker to recover verbatim text from the model’s training dataset — including personally identifiable information — using a simple prompt manipulation technique called a divergence attack. The findings, published in arxiv:2311.17035, show that large language models memorize and can reproduce specific data fragments even after alignment fine-tuning intended to prevent this.

How the Divergence Attack Extracts Training Data

The attack exploits a behavioral quirk in autoregressive language models: when forced to repeat a single word or phrase indefinitely, the model’s decoding process diverges from its fine-tuned, RLHF-aligned behavior and falls back to raw pre-training patterns. The researchers instructed ChatGPT to repeat the word “book” continuously. After generating aligned output for several iterations, the model began producing unrelated text — snippets that matched verbatim passages from its training corpus, identified by color-coded overlap analysis comparing generated text against the training dataset.

Scale of Memorized Data: Experiments on GPT-Neo, LLaMA, and ChatGPT

The team generated billions of tokens across multiple models including GPT-Neo, LLaMA, and ChatGPT, then verified which outputs matched training data. The memorization rate was significantly higher when using the divergence attack compared to normal querying. ChatGPT’s alignment training reduced (but did not eliminate) memorization — the model still reproduced personal data including email addresses, names, and phone numbers present in its training set. The researchers estimated that an attacker could extract on the order of one memorized training example per dollar of API cost.

Privacy Risks from Retrievable Memory in Production LLMs

The practical risk is twofold. First, web-scraped training data contains real personal information — contact details, forum posts, private documents that were indexed before removal. These can be reconstructed from a deployed model without any access to the original training data. Second, fine-tuned models trained on internal enterprise data are at higher risk: if the training corpus contains customer data or proprietary information, a divergence-style attack could expose it to anyone with API access.

Mitigations: Differential Privacy and Extraction Rate Limits

The research identifies several mitigations that model developers can implement. Differential privacy training formally bounds how much any individual training example can influence model outputs, limiting extractability. Rate limiting and anomaly detection on repetition-heavy prompts can detect divergence attacks in production. For organizations fine-tuning models on sensitive data, the paper recommends auditing training data to remove personal information before fine-tuning, and testing fine-tuned models against extraction probes before deployment. OpenAI has implemented additional guardrails in subsequent versions of the API, though the underlying extractability of memorized data remains an open research problem.

How the Divergence Attack Extracts Training Data

Scale of Memorized Data: Experiments on GPT-Neo, LLaMA, and ChatGPT

Privacy Risks from Retrievable Memory in Production LLMs

Mitigations: Differential Privacy and Extraction Rate Limits

CyberSecureFox Editorial Team

Leave a Comment Cancel reply

Cybersecurity News

Inside BlueNoroff’s ClickFix Kit Masquerading as Zoom and Teams

Cybersecurity News

How the AgentForger CSRF Bug Turned ChatGPT Agents into Corporate Insiders

Cybersecurity News

How Bing’s image pipeline exposed critical RCE via ImageMagick delegates

Cybersecurity News

How the wp2shell Vulnerability Chain Enables RCE on Vanilla WordPress

Cybersecurity News

CVE-2026-6875: Pre-auth Sandbox Escape RCE in ServiceNow AI Platform

Cybersecurity News

How FakeGit Exploits MCP Servers and AI Agents for StealC Theft

ChatGPT: Unveiling the Risks of Data Memorization in Large Language Models

How the Divergence Attack Extracts Training Data

Scale of Memorized Data: Experiments on GPT-Neo, LLaMA, and ChatGPT

Privacy Risks from Retrievable Memory in Production LLMs

Mitigations: Differential Privacy and Extraction Rate Limits

CyberSecureFox Editorial Team

Leave a Comment Cancel reply

most recent

Cybersecurity News

Inside BlueNoroff’s ClickFix Kit Masquerading as Zoom and Teams

Cybersecurity News

How the AgentForger CSRF Bug Turned ChatGPT Agents into Corporate Insiders

Cybersecurity News

How Bing’s image pipeline exposed critical RCE via ImageMagick delegates

Cybersecurity News

How the wp2shell Vulnerability Chain Enables RCE on Vanilla WordPress

Cybersecurity News

CVE-2026-6875: Pre-auth Sandbox Escape RCE in ServiceNow AI Platform

Cybersecurity News

How FakeGit Exploits MCP Servers and AI Agents for StealC Theft

CyberSecureFox