Is AI data retention in models like ChatGPT a privacy threat?

CSIRO’s Alice Trend, along with David Zhang and Thierry Rakotoarivelo, explores the training methods of new AI models, such as ChatGPT, which pose challenges in data retention, prompting substantial online privacy concerns.

In a recent development, Italy imposed a temporary ban on ChatGPT due to concerns over privacy. OpenAI, the company behind ChatGPT, responded by pledging to provide a platform for citizens to voice their objections to the use of their personal data in training AI models, with the ultimate goal of lifting the ban.

Underpinning this situation is the “right to be forgotten” (RTBF) law, established in a 2014 EU case, which grants individuals the authority to request the removal of their personal data from technology companies. However, implementing RTBF in the context of large language models (LLMs) like ChatGPT poses unique challenges, as elucidated in a recent paper on machine unlearning authored by cybersecurity researcher Thierry Rakotoarivelo.

When a citizen registers objections to the utilization of their data in AI training, the process becomes notably intricate compared to dealing with search engines, demanding innovative solutions.

Privacy concerns, legal implications, and ethical dilemmas

ChatGPT relies on a vast repository of 300 billion words to enhance its performance. OpenAI, the company behind ChatGPT, collected this data from various sources on the internet, encompassing books, articles, websites, and posts, including some personal information acquired without consent.

This data collection raises multiple concerns. Firstly, individuals were not given the option to grant permission for OpenAI to utilize their data, infringing on their privacy, especially when the data is sensitive and could reveal personal details about them, their family, or their location.

Even when data is publicly accessible, its use can violate the concept of contextual integrity, a critical privacy principle. This principle asserts that individuals’ information should not be disclosed outside of the original context in which it was generated.

Additionally, OpenAI does not provide any mechanisms for individuals to verify whether their personal data is stored by the company or to request its deletion. This right, commonly known as the “right to be forgotten,” is a fundamental component of the European General Data Protection Regulation (GDPR). However, it remains uncertain whether ChatGPT complies with GDPR standards.

The “right to be forgotten” becomes particularly significant when the information is inaccurate or misleading, a situation that frequently occurs with ChatGPT. Furthermore, the data used to train ChatGPT may include proprietary or copyrighted content. For example, the tool can produce excerpts from copyrighted texts like Joseph Heller’s book “Catch-22” upon request.

“If a citizen requests that their personal data be removed from a search engine, relevant web pages can be delisted and removed from search results,” Thierry said.

“For LLMs, it’s more complex, as they don’t have the ability to store specific personal data or documents, and they can’t retrieve or forget specific pieces of information on command.”

So, what’s the inner workings of LLMs?

LLMs craft responses by leveraging patterns ingrained during their extensive training on vast datasets.

“They don’t scour the web or index sites for answers. Instead, they anticipate the following word in a response by analyzing the context, word patterns, and relationships within the query,” explained Thierry.

CSIRO’s cybersecurity expert, David Zhang, who authored “Right to be Forgotten in the Era of Large Language Models: Implications, Challenges, and Solutions,” offers a relatable analogy to illustrate how humans employ their learned training data for speech generation.

“Similar to how Australians can predict that after ‘Aussie, Aussie, Aussie’ comes ‘oi, oi, oi’ based on training data from international sports matches, LLMs utilize their training data to predict their next words,” explained David.

“Their aim is to produce text that resembles human language, remains pertinent to the query, and is coherent. In this regard, an LLM functions more like a text generator than a search engine. Its responses don’t originate from a searchable database but are instead crafted based on its accumulated knowledge.”

Can we make them forget?

Machine unlearning emerges as the leading solution for enabling LLMs to forget training data, but it’s a highly intricate process. In fact, it’s so complex that Google has thrown down a challenge to researchers worldwide to advance this solution.

One approach to machine unlearning involves selectively removing specific data points from the model through accelerated retraining of particular segments. This approach avoids the need to retrain the entire model, which is both costly and time-consuming. However, identifying which parts of the model require retraining poses a challenge, and this segmented approach may raise fairness concerns by potentially removing important data points.

Other methods include approximate techniques with mechanisms to verify, erase, and safeguard against data degradation and adversarial attacks on algorithms. David and his colleagues propose several interim solutions, such as model editing to make quick adjustments while a more comprehensive fix is being developed or a new model with a modified dataset is under training.

In their research paper, the team employed clever prompts to induce a model to forget a well-known scandal by reminding it that the information was subject to a right to be forgotten request.

Persistent data privacy concerns plaguing LLMs could have been mitigated if responsible AI development principles had been integrated throughout the tool’s lifecycle.

Many prominent LLMs in the market are often referred to as ‘black boxes,’ meaning their internal operations and decision-making processes remain inaccessible to users. In contrast, Explainable AI encompasses models where the decision-making processes can be traced and comprehended by humans, offering transparency and accountability.

When utilized effectively, Explainable AI and responsible AI techniques can shed light on the root causes of issues within models, as each step is comprehensible, facilitating the detection and resolution of problems. By incorporating these principles and other AI ethics standards into new technology development, we can better evaluate, investigate, and mitigate these challenges.

Source: CSIRO

Keep up to date with our stories on LinkedIn, Twitter, Facebook and Instagram.

Is AI data retention in models like ChatGPT a privacy threat?

Privacy concerns, legal implications, and ethical dilemmas

So, what’s the inner workings of LLMs?

Can we make them forget?

How Global Recognition Awards solved bias in business recognition

Built for the game, built for Australia: Inside DreamHoops’ craft of basketball excellence

How remote-first culture helps companies attract top global talent

How SAP Business One can fit your business needs: An evaluation guide

The worst-case cyber scenario: A call to action for Australian organisations