Whether communicating by email or videoconference, or working with customer or employee data, dealing with personal data is part of our everyday professional (and personal) lives. The fact that data protection must play a paramount role in companies and public authorities is nothing new. But how can data protection be guaranteed when the use of protected data is at the heart of a research project, and the project cannot be carried out without the information provided by the data? Anonymisation and pseudonymisation play a central role in meeting this challenge. They make it difficult or even impossible to (re)attribute data to a specific person. The particular difficulty here is to maintain the usability and informative value of the data while pseudonymising it. In this article, we look at how the supposed contradiction between data protection through pseudonymisation and the use of personal data can be dealt with in research practice. We also take a look at the particular challenges that stakeholders in the healthcare sector face in this regard.
What does anonymisation mean?
Strictly speaking, anonymisation is not mentioned at all in the legal text of the GDPR, with only Recital 26 explicitly referring to “personal data rendered anonymous”. For a legal definition, we therefore need to refer to the Open Data Directive 2019/1024: “Anonymous information is information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or is no longer identifiable.” The decisive factor is that it is no longer possible to identify a person, i.e. the specific person about whom the information provides information can no longer be determined. In this respect, data protection law shifts its protection forward to the moment when the identification of natural persons is only potentially possible, rather than having already taken place. However, without the ability to attribute data to a specific individual, the GDPR no longer needs to be applied, as there is no longer an individual to protect under data protection law.
This means that in practice the possibility of identifying the person should actually be excluded in order to avoid violations of the law. According to the European Court of Justice (ECJ, judgment of 19 October 2016, C-582/14), a person is already identifiable within the meaning of the GDPR if conclusions can be drawn only indirectly about their identity. In this case, the ECJ ruled that even dynamic IP addresses, i.e. those that do not contain direct information about the person accessing the website, can constitute personal data for a website operator. It is irrelevant that the additional information required to identify an individual is not available from the provider of the service itself, but from the internet service provider – because the former has the possibility, under certain circumstances, to demand that the internet service provider link the data to the individual. This is precisely what is relevant under Recital 26 of the GDPR: “To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, (…) either by the controller or by another person to identify the natural person directly or indirectly.” Consequently, anonymisation does not require that identification be objectively impossible for everyone – rather, so-called “de facto anonymity” is sufficient. But the standards are high. According to the judgment, it is not necessary for a possible re-identification “that all the information enabling the identification of the data subject must be in the hands of one person”. Companies should therefore not be too quick to assume that data is anonymous and that data protection law does not apply.
How can data be anonymised?
The GDPR only sets out the framework conditions for anonymisation. In practice, this means that data usually needs to be technically modified to a greater extent than simply adjusting or removing the person’s name in a data set. For example, if a pharmaceutical company is developing a vaccine against a virus and is conducting a clinical trial on a few hundred people, this will generate a large amount of data on the subjects. This may include each person’s name, address, age, weight, information about pre-existing conditions, etc. Even if the company were to remove the name and address of the individual, as this information is not needed for the purpose of the study, it may still be possible to identify individuals without much effort, based on their pre-existing conditions and age, or medications they are taking, and so on. In such a case, there is no question of anonymisation.
There is no general rule as to what measures should be taken to achieve complete anonymisation, but data sets should be modified in such a way that any possible combination of data in the data set results in at least two hits or can be traced back to at least two different individuals. It goes without saying that the higher the number of hits, the better and safer the result. The more specific the information, the more the data set will need to be modified. In addition to not disclosing or erasing data, in practice there are some other anonymisation techniques that controllers can use.
The generalisation/coarsening method enlarges the scale of the data sets to prevent any attribution to individuals. For example, subjects can be divided into age groups, which then replace their exact age. Again, care must be taken not to render the data useless by over-generalisation. By randomly swapping columns in a table, groups of data can be reassigned to other groups of data while other columns remain unchanged. As this can cause statistical correlations to be lost, some of the methods are adapted so that only similar values are swapped – so that, for example, swapping diagnoses for people of the same sex does not change the statistical picture, but preserves the correlations between sex and disease.
So-called “noise” introduces fictitious measurement errors, which slightly manipulate the data without changing the overall picture conveyed by the statistics. This can be done, for example, by changing a person’s date of birth from 5 to 10 April or similar. Also, completely new, artificial data can be created to replace the original data – where the newly generated data set is based on a statistical model created from the original data. It is also possible to simply reduce the number of people represented in a data set. To do this, individual lines are omitted entirely or only revealed at random. Again, it is important to maintain statistical information as much as possible.
None of the above methods can completely eliminate the risk of re-identification. If an attacker gains access to other data that is combined with the anonymised data, they may be able to re-establish an individual’s identity. This is why it is often useful and advisable to combine several anonymisation techniques in order to achieve secure anonymisation.
Pseudonymisation and the attribution of additional knowledge
Unlike anonymisation, it is sufficient for pseudonymisation to separate identity data from information data (Art. 4 No. 5 GDPR). This means that attribution of the pseudonymised data to an individual is still possible using the associated key, which is why the data should still be considered personal data in accordance with Recital 26, Sentence 2 of the GDPR. The “pseudonymisation key” has to be kept separately and under special protection. As such pseudonymisation itself is one of the appropriate technical and organisational measures to protect personal data, and the GDPR still applies to pseudonymised data. This is because as long as it is possible to re-attribute data to a specific individual, this EU regulation applies. This conclusion is not a problem, at least in the case where the pseudonymised data and the corresponding key are held (albeit separately) by the same controller.
The situation might be different if the data necessary for re-identification were not held by the controller but by a third party, and the former had only the pseudonymised data without any possibility of identification. If this additional knowledge of a third party is inaccessible to the controller, i.e. if there is no relationship between the two entities, this could result in de facto anonymisation. This is partly supported by the argument that it must be possible in practice to re-attribute the data to a specific individual in order to be able to assume that the data is personal in nature. In some cases, it is even required that, in addition to the mere practical possibility, there must also be a subjective intention on the part of the controller in order for the GDPR to apply. This broad interpretation of anonymisation is challenged by others who argue that there is always the possibility of attributing data to an individual person if additional knowledge is held by a third party, regardless of the circumstances under which and by whom it can be accessed. The ECJ’s position in this argument was already explained above – according to the case law, additional knowledge held by third parties is attributable (and the GDPR applicable) if access to that additional knowledge by the controller can reasonably be expected. Ultimately, ambiguity in this area cannot be completely eliminated.
In practice, pseudonymisation often uses the hashing method, which replaces certain values with strings. Encryption techniques are also used, where a cryptographic algorithm is used to create an encrypted value from plaintext. Alternatively, pseudonyms can be randomly generated and stored in tables. While encryption allows the original value to be easily recalculated using the key, the hashing process is not easily reversible. In terms of data protection compliance, a comprehensive and technically reliable access and authorisation policy is also important here. There are different levels of pseudonymisation, so both strong and weak pseudonymisation are possible. Strong pseudonymisation is particularly necessary when special categories of data are processed according to Art. 9 of the GDPR, such as data concerning health, or when the data is exposed to an increased risk.
Remember the legal basis!
According to the GDPR, every type of data processing requires a legal basis in order to be legally compliant. However, it is often overlooked that the anonymisation and pseudonymisation of data are themselves a form of data processing. Even if at first sight both instruments have an exclusively positive impact on data protection, it should be noted that the loss of data can also adversely affect the rights and freedoms of data subjects. In principle, all legal bases of Art. 6 of the GDPR come into consideration, in particular legitimate interests, performance of a contract, and consent. Special legal bases may also be used, such as Sect. 27 of the German Federal Data Protection Act (BDSG) for scientific research or statistical purposes.
What are the advantages and disadvantages?
The advantage of anonymisation is obvious: if data cannot be attributed to individuals, and the GDPR therefore does not apply, the controller is relieved of the burden of complying with data protection requirements. However, companies should carefully consider beforehand whether it is even possible to completely eliminate the possibility of attributing data to individuals, and how much effort is likely to be involved. Some even argue that “genuine” anonymisation is practically never possible, as with big data, for example, there is always the possibility of re-identification. According to a 2019 study carried out by research institutions including Imperial College London and the Université catholique de Louvain in Belgium, in all the data sets examined 99.98 % of US citizens could be identified on the basis of just 15 characteristics, such as age or place of residence; in 80 % of cases, only the three characteristics of gender, date of birth and postcode were sufficient for re-identification in supposedly anonymised data sets. So it is clear that this issue merits special attention. On the other hand, it is necessary to check whether the data set will still be usable after it has been anonymised.
The process of pseudonymisation, on the other hand, does not exempt the controller from complying with the requirements of the GDPR. Pseudonymisation may work in favour of the controller’s legitimate interests in data processing. This is because the stronger the pseudonymisation, the more likely it is that the interests of the company (or the controller) will prevail, as the data subjects will be better protected under data protection law. In addition, the requirements for technical and organisational measures to protect data processing processes will be reduced. Attacks on the protected data can then only be carried out by singling out, linking or inference. Singling out involves attackers isolating data that relates to individuals, or attempting to combine multiple pieces of data and draw conclusions. Linking is the process of linking multiple data sets to identify correlations. Inference is the process of determining personal information by making logical deductions from the combination of certain data sets. If controllers are aware of these risks of attack, it will be easier for them to choose the appropriate pseudonymisation procedure and thus to eliminate as many of the risks of re-identification as possible.
Special case: Anonymisation and pseudonymisation in the healthcare sector
Dealing with data, especially large amounts of it, is unavoidable in the healthcare sector – whether it is for treatment in hospital, or for handling CT scans or X-rays. More and more people are using health apps that collect data about their physical condition. Findings from clinical trials can also help to improve diagnostic and therapeutic options. All of these cases involve data concerning health, which is one of the specially protected categories of data pursuant to Art. 9 of the GDPR and the processing of which is subject to additional requirements.
The healthcare sector also poses particular challenges for anonymisation. Given the great potential of using data concerning health for research purposes, which was illustrated particularly clearly during the coronavirus pandemic, researchers often complain that data protection rules are an obstacle to the freedom of science and research. The appeal to clear the way for research with health-related data is directed in particular at the national government – because other EU countries, such as Finland, already make this possible in a much simpler way. However, in order to ensure that health-related data cannot be attributed to a specific person, advanced anonymisation methods are needed that can cope with the sheer diversity of such data – after all, medical studies analyse not only information in tables, but also image data (such as X-rays) or samples and voices.
This is not the only reason why it makes sense to use pseudonymisation methods in the healthcare sector. As it is possible to restore the possibility of attributing pseudonymised data to individuals (see above), this procedure is also suitable for informing study participants of new findings at a later stage. This can be essential, for example, to help those affected by a genetic disease to schedule important screening tests or learn about new treatment options. Especially in the case of rare diseases, it is hard to imagine real progress without data being stored in a database for a long time – and possibly later used for other purposes that were not foreseeable at the time the database was created, but which are useful for the patients concerned. Doctors regularly report that people with rare diseases, in particular, want to be kept informed of progress long after they have participated in a trial, as this is often their only hope of a cure. This complex situation ultimately means that the way in which data protection requirements are implemented will always depend on the nature and scope of the use of medical data. But it is clear that there are few other areas where innovation and progress depend so much on the timely translation of regulatory requirements into practical guidance for doctors and researchers.
Many practical and legal challenges need to be addressed when anonymising and pseudonymising data. Pseudonymisation is clearly more practical than anonymisation, where it is often not possible to be certain that the possibility of attribution to individuals has been completely and irrevocably removed, especially as anonymous data is often not as valuable for research as personal data. Nevertheless, both instruments are suitable for striking a balance between the data subject’s right to “informational self-determination” and the freedom of science and research. It is also clear that personal data must be made usable for research and science, and that this calls for practical solutions. A good example of this is the European Health Data Space Regulation proposed by the Commission, which provides for the cross-border exchange of health data within the EU.
It is therefore advisable to decide between the two procedures according to the purpose that the data must fulfil after the removal or dilution of the possibility of attribution to individuals. In some cases, it will be necessary to use pseudonymisation anyway, because the restoration of this possibility is required by law – for example, in the case of the obligation to document medical treatments under Sect. 630f of the German Civil Code (BGB). In other cases, it may simply be desired, or it may be in the interest of a medical trial, for example, where the data protection principle of purpose limitation collides with a long-term interest in acquiring knowledge. Furthermore, as absolute anonymisation is very difficult to achieve, it is advisable to assume the applicability of the GDPR in case of doubt. Although pseudonymisation cannot completely eliminate data protection measures, it is still a good tool to at least reduce the effort associated with them. In this way, it can at least facilitate the use of personal data for research and other innovative purposes. It is to be hoped that, in the foreseeable future, there will be consensus on the concrete requirements for anonymisation – and whether, in practice, relative anonymisation can be sufficient.
Do you have questions about anonymisation and pseudonymisation? We would be happy to help. Contact us to arrange a no-obligation consultation
Joint controller agreement: Benefits and challenges of joint controllership
The Joint Controller Agreement (JCA) still seems complicated and cumbersome to many managers in practice. But wrongly so: with the careful design of the agreement, responsible companies can benefit from many advantages, achieve efficiency gains through forward-looking process design and operate a corresponding effective risk management.
The NIS 2 Directive: Key objectives and regulations
Last December, the European Council and the European Parliament adopted the Network and Information Security Directive (NIS 2 Directive), thus initiating a reform of the legal requirements for IT security in the European area. After coming into force on 2023-01-16, Germany and the other EU member states now have 21 months to transpose the regulations…
Anonymisation and pseudonymisation in practice
In this article, we look at how the supposed contradiction between data protection through pseudonymisation and the use of personal data in scientific practice can be dealt with. In addition, we take a look at the special challenges that actors in the health care sector face in this topic.