Anonymization is the only method for removing data from the scope of the RGPD. However, unlike pseudonymization, it must be irreversible. Unfortunately, this notion of irreversibility is a source of frequent confusion between anonymization and pseudonymization, and can expose data controllers to significant penalties. On this subject, the G29 (Group of European CNILs) published an opinion on anonymization techniques in 2014, in which it defined the notion of irreversibility.

Data anonymization is now required in all cases of personal data processing proscribed by the RGPD; whether this involves the retention of data beyond the legal retention period, the use of personal data in non-production environments for testing and acceptance, or even, the processing of data for a purpose different from the purpose for which the data was collected, without a viable legal basis (e.g. using banking services data for marketing without the consent of individuals). All such processing requires the implementation of a compliant anonymization process.

However, the notion of anonymization compliance is essentially based on its irreversible nature, meaning that it should not be possible to re-identify data subjects on the basis of anonymized data. Irreversibility is also what distinguishes anonymization from pseudonymization, which is reversible. However, the irreversibility criterion is a source of confusion, as it is very often misinterpreted, exposing the data controller to significant financial penalties, as provided for by the RGPD (4% of the company's worldwide turnover or 20 million euros, whichever is higher), but also, to a risk to its reputation, as poorly anonymized data can lead to the re-identification of data subjects and thus, to impacts in terms of image, difficult to make up for.

1. Errors in the irreversibility of anonymization
 a. Thinking that deleting "identifying" data guarantees irreversibility

 The main error concerning irreversibility is to think that deleting "identifying" data is enough to guarantee irreversibility. This error is due in part, to a misinterpretation of the definition of pseudonymization (Article 4 of the RGPD):

Pseudonymization (RGPD): " The processing of personal data in such a way that it can no longer be attributed to a specific data subject without recourse to additional information, provided that this additional information is kept separately and subject to technical and organizational measures to ensure that the personal data is not attributed to an identified or identifiable natural person ."

The definition of pseudonymization specifies that, to protect data subjects, it is necessary to store additional information ("identifying" data) separately, and subject it to technical and organizational measures to guarantee its security. This "identifying" information is most often information such as surname, first name, social security number or bank card number....

The mistake we make is to consider that simply by deleting the "identifying" data (stored separately), the pseudonymized data becomes anonymous. Indeed, even if the "identifying" data is deleted, the rest of the data ("non-identifying" data) may still contain information that can be used to re-identify the data subjects. Such "non-identifying" data are known as quasi-identifiers, because although they do not directly identify a person, their combination with other data of the same type can lead to complete identification of the persons concerned. For example, in 1997 in the USA [1], "anonymized" data (by deleting "identifying" data) published by an insurance agency enabled the persons concerned to be re-identified on the basis of information such as age, gender and postal code, which are quasi-identifiers. These quasi-identifiers made it possible to re-identify 87% of the people in the dataset, as well as the governor of the state of Massachusetts at the time, by the name of William Weld.

Thus, deleting "identifying" data is not enough to protect individuals, therefore, even if "identifying" data is deleted, the rest of the data is still reversible and therefore remains under the protection of the RGPD.

 b. Rely on non-referenced anonymization methods to guarantee irreversibility

In order to guarantee irreversibility, another mistake is to use non-referenced and sometimes "customized" anonymization methods. Indeed, history records several cases of poor anonymization, based on non-referenced methods, which led to scandals once the people concerned had been re-identified. This is the case, for example, of the American Internet services company AOL [3], which in 2006 published "anonymous" data on search requests made by its users, using unreferenced anonymization methods. This data was subsequently used to re-identify user "4417749", based on queries such as: "dog that urinates on everything", "60 single men", "landscapes in Lilburn, GA". Indeed, user "4417749" was re-identified as "Thelma Arnold", a widow with 3 dogs living in Lilburn, a town in Gwinnett County, Georgia, USA. As another example of poor anonymization (based on "bespoke" methods), we can cite the more recent case concerning the New York cab agency "NYC Taxis" [4], which publishes "anonymized" data on the basis of which it was possible to re-identify several movie stars ("Ryan Reynolds" and "Jessica Alba") and accurately determine their journeys in New York City and even discover that they had not left a tip. Worse still, this data was used to re-identify customers of a strip bar on the outskirts of the city, tracing their route from the bar to their home. All these attacks relied on flaws in the unreferenced anonymization methods used to anonymize the data.

In order to reduce the risk of re-identification, the G29 [2] recommends a set of techniques to be used to anonymize data; these techniques serve as a benchmark for anonymization. They are classified into 2 main models: randomization and generalization. Randomization refers to techniques that alter the veracity of the data in order to weaken the link between the data and the individual, and generalization refers to techniques for diluting the attributes of the people concerned by modifying their respective scale or order of magnitude (for example, a region rather than a city, a month rather than a week).

2. What does the G29 consider irreversible?

Irreversibility must consider both direct identification and indirect identification as described in the definition of anonymization:

Anonymization (ISO 29100)Anonymization: the process by which personally identifiable information is irreversibly altered , so that the person to whom the information relates can no longer be identified directly or indirectly.

Direct identification refers to identifiers such as name, social security number, e-mail address, credit card number; and indirect identification refers to quasi-identifiers such as gender, age, postal code, but also the breed of dog, the color of a shirt, the key words of a query...

Indeed, any information can be used as a quasi-identifier, as the notion of quasi-identifier essentially depends on the context defined by the data in question (e.g.: the keywords "dog who urinates everywhere" were used to re-identify "Thelma Arnold", cf. Section 1.b). What's more, not all quasi-identifiers are equivalent; some quasi-identifiers have a higher degree of identification than others, and this also depends on the context of the data. For example, in a dataset containing the age and gender of people, if there is only one "male" person aged "27", then this information is a quasi-identifier which, in this particular dataset, has the highest degree of identification. On the other hand, in a different dataset, where there are several "male" people aged "27", this information is still a quasi-identifier, but with a lower degree of identification.

It was the need to take into account the context of the data, as well as the risks associated with indirect identification (via quasi-identifiers), that led the G29 (Group of European CNILs) to publish an opinion on anonymization techniques in 2014, which describes the principles to be respected when it comes to anonymization. The G29 thus defines three risks when it comes to anonymization:

  • Individualization: corresponding to the possibility of isolating some or all of the records identifying an individual in the dataset.
  • Correlation: which consists in the ability to link together at least two records relating to the same data subject or group of data subjects (either in the same database, or in two different databases). If an attack makes it possible to establish (for example, by means of correlation analysis) that two records correspond to the same group of individuals, but does not make it possible to isolate individuals within this group, the technique resists "individualization", but not correlation;
  • Inference: the ability to deduce, with a high degree of probability, the value of an attribute from the values of a set of other attributes.

Thus, defining a compliant, and therefore irreversible, anonymization method involves reducing to an acceptable level the risks associated with individualization, correlation and inference. Irreversible anonymization is therefore not an absolute notion, but a question of risk management. However, there is no such thing as zero risk when it comes to anonymization; there is always a non-zero risk that a person can be identified. Indeed, as in other areas of security, the aim is to reduce the risk of attack. What's more, it would be pointless to produce completely anonymous data; anonymization is a process that transforms data by reducing the amount of information contained in it; if data is completely anonymized, it may become useless. Anonymization involves finding the best compromise between personal protection and data usability. For example, applying the generalization anonymization model (see Section 1.b), we could use a département rather than an arrondissement to define a locality, which might mean using "75***" (Paris) instead of "75015" (15th arrondissement). This transformation reduces the precision of the data to protect the people concerned, but it must be ensured that it guarantees the usability of the data, i.e. that using the département instead of the arrondissement does not distort the results of the analysis.

3. Conclusion

The notion of irreversible anonymization is therefore a question of risk assessment, as defined by the G29 (individualization, correlation and inference). To produce data that is anonymous and therefore irreversible, it is necessary to reduce these three risks to an acceptable level. Consequently, the irreversibility of anonymization is not an absolute but a relative notion; which is also desirable, as completely anonymous data would be of no use to data controllers.

[1]: Barth-Jones, Daniel (July 2012). "The re-identification of Governor William Weld's medical information: a critical re-examination of health data identification risks and privacy protections, then and now." Then and Now.

[2]: Opinion 05/2014 on Anonymization Techniques, adopted on April 10, 2014: https: //www.dataprotection.ro/servlet/ViewDocument?id=1288

[3]: Barbaro, M., Zeller, T., & Hansell, S. (2006). A face is exposed for AOL searcher no. 4417749. New York Times.

[4]: Narayanan, Arvind, & Edward W. Felten (2014). "No silver bullet: De-identification still doesn't work." White Paper: 1-8.