Anonymization: common mistakes and recommended measures
Anonymization is the only treatment that allows data to fall outside the scope of the RGPD. However, it is a source of many errors and misunderstandings, mainly due to a lack of awareness of the so-called re-identification risks, but also, to ignorance of the regulatory measures framing this processing of personal data.
Anonymization takes data outside the scope of the RGPD. This means that none of the principles of the GDPR apply to anonymized data, giving companies greater flexibility to develop innovative services and improve existing ones. Certain processing operations are particularly well-suited to this, including the development of artificial intelligence models, the use of production data for application testing, data retention beyond the legal retention period, healthcare research, open data, etc.
However, to benefit from all these advantages, anonymization must be carried out in a compliant manner. So what does "compliant anonymization" mean?
To answer this question, Working Party 29 (G29) - now the EDPS - which brings together all European data protection authorities (including the CNIL), published an opinion on anonymization techniques in 2014. In this opinion, the G29 identifies several criteria, both technical and regulatory, that need to be met in order to claim compliant anonymization. These criteria aim to ensure that the anonymization process sufficiently reduces the risks of re-identification.
However, although this opinion is considered by the CNIL to be the main reference on the subject of anonymization, it is still little known by most organizations handling personal data. What's more, those who are aware of it don't necessarily have the expertise required to implement the technical measures it recommends. In fact, the notice specifies that for an anonymization method to be compliant, it must meet technical criteria such as individualization, correlation and inference.
These limitations are the main source of errors and confusion when it comes to anonymization.
- Anonymization errors
Anonymization errors can be broken down into 3 categories: regulatory, organizational and technical.
a. Regulatory framework
- Don't consider anonymization as personal data processing: in fact, anonymization is personal data processing because, while at the output of anonymization the data is anonymized and outside the scope of the RGPD; at the input of anonymization, it is indeed personal data, framed by the RGPD. If, moreover, anonymization is eligible for a DPIA (e.g.: use of health data and use of innovative technologies), this could lead to substantial penalties (up to 2% of turnover, or 10 million euros), as provided for by the RGPD in the absence of a DPIA.
- An inappropriate legal basis: as anonymization is a processing operation, it requires an appropriate legal basis (e.g. consent, legitimate interest, legal obligation). One of the most commonly used legal bases for anonymization is "legitimate interest". However, this legal basis requires a well-conducted balance of interests, which is unfortunately often biased or absent.
- Failure to inform individuals: it is necessary to inform individuals, even in the case of processing aimed at anonymization. It is therefore necessary to anticipate and inform people at the time of collection, as it may be too late afterwards (e.g.: informing people whose data is in a Data Lake).
- Thinking it's necessary to delete the original data: contrary to popular belief, it's not necessary to delete the original data that served as the source for anonymization. In fact, if this data is used as part of RGPD-compliant processing, it can be retained
b. Organizationally
- Not anticipating anonymization actions: this is particularly useful in the case of anonymization to comply with retention periods or the right to be forgotten. These processes are carried out as they happen, according to a predefined timetable. If an anonymization process is not planned in advance, it will be difficult to remedy the situation.
- Failure to include specific contractual clauses: specific contractual clauses must be defined, particularly for data exchanges with external service providers. Among other things, they should include a commitment by service providers and subcontractors not to attempt to re-identify data subjects.
- Do not centralize the anonymization process: the anonymization process must be centralized, in particular to guarantee consistency of anonymization across different environments/cases of use, and to control data purging and security.
- Do not include additional security measures: the aim of anonymization is to reduce the risk of re-identification, and it is always useful to include additional security measures (e.g. access control, data encryption, etc.). This is particularly useful when the residual risks, after analysis of re-identification risks, remain significant
c. Technical aspects
- Thinking that an anonymization method is suitable for all use cases: anonymization is generally carried out for one or more specific use cases, and is not suitable for all use cases. It is therefore advisable to carry out a usage analysis to ensure that the data produced remains useful for the target purpose.
- Confusing anonymization with pseudonymization: by far the number-one mistake when it comes to anonymization. It stems mainly from a misinterpretation of the term "irreversible". Unlike pseudonymization, anonymization must be irreversible, and it must not be possible to retrieve the original data from the anonymized data. However, in practice, methods such as hashing or encryption are wrongly used as anonymization methods, and the notion of irreversibility is associated with the difficulty of recovering the original data from the data transformed by these methods. This is a mistake. In fact, although these methods are - to a certain extent - irreversible, they do not guarantee the irreversibility of the data as a whole, as they only act on part of the data; the rest of the data can be re-identified by individualization, correlation and/or inference. The CNIL also points out that hashing and encryption are not anonymization methods, but pseudonymization methods. Another mistake is to think that by multiplying different techniques, we necessarily arrive at anonymized data, which is false. To better understand this, let's look at the 2 tables below, which represent an original dataset (table above) and its so-called anonymized version (table below). Several techniques have been used to anonymize the data (as can be seen from the diagram): hashing, deletion, masking, rounding... However, it is still possible to identify Mr. K. Mitnick - i.e., find out what illness he suffers from - on the basis of the anonymized data. In fact, if you look at the transformed e-mail addresses, you'll see that they're all different, making it possible to individualize all the people in this table, and Mr. Mitnick in particular. This makes it easy for an attacker who has Mr. Mitnick's e-mail address, and who knows that he took part in the study, to discover that he suffers from HIV.
2. Recommended measures :
In order to guarantee compliant anonymization, the CNIL recommends a risk analysis based on the 3 re-identification criteria:
- Individualization (G29 Opinion) : which corresponds to the possibility of isolating some or all of the records identifying an individual in the data set;
- Correlation (G29 Opinion): which consists in the ability to link together at least two records relating to the same data subject or to a group of data subjects (either in the same database, or in two different databases). If an attack makes it possible to establish (for example, by means of correlation analysis) that two records correspond to the same group of individuals, but does not make it possible to isolate individuals within this group, the technique is resistant to "individualization", but not to correlation;
- Inference (G29 Opinion) : which is the possibility of deducing, with a high degree of probability, the value of an attribute from the values of a set of other attributes.
In response, the G29 has identified specific anonymization techniques, based on 2 models: generalization and randomization. Generalization aims to dilute the level of information contained in a datum (e.g. using a zip code rather than an address, a year rather than a complete date), while randomization aims to alter the veracity of the data in such a way as to prevent re-identification (e.g. rounding off values, permuting values, etc.).
As we saw in the previous example, it's not enough to use one or other of these models to "magically" guarantee compliant anonymization. In all cases, it is necessary to assess whether they effectively reduce the risks of re-identification, i.e. individualization, correlation and inference. If the residual risk after applying these techniques remains high, it will be necessary to implement additional measures to reduce this risk to an acceptable level. It is also advisable to seek expert advice in case of doubt.
3. Conclusion:
Anonymization allows greater flexibility in the development of innovative services and the improvement of existing ones. However, it is a source of errors and misunderstandings on the part of organizations processing personal data; this is mainly due to a lack of knowledge of the texts governing anonymization, but also to the level of expertise required to carry out a successful anonymization process. To avoid such errors, the CNIL recommends carrying out a re-identification risk analysis based on the 3 re-identification criteria of individualization, correlation and inference. It is also recommended to call in an expert in case of doubt.