Data about human beings are very rarely anonymous. Even if the data do not contain information such as name or address, the data may still pose privacy risks to the people they are about, especially if it is possible to re-identify individuals within the data you are using for your research. To manage these risks, it is important to understand how severe the risks are in your situation. The table below can be used to determine privacy risk categories for different types of data, however, it’s recommended to now use the privacy risk app (which is based on the table below) because the app offers advice that is specific to your needs.

Privacy risks are based on three pillars: the vulnerability of the research subjects, the re-identifiability of the data and the risk of harm to the research subjects that is posed by the information in the data. The guidance on this page, nor on the privacy risk app, cannot capture every possible situation, so think of this advice as a spectrum. If in doubt, opt for a higher risk category. Also, not all of the data in your research will have the same level of privacy risk and the privacy risks for each type of data may change as you clean, recode and modify the data from a raw to processed form. You should assess the risks for each separate data asset and also consider how each data asset’s risk will change over the course of the research life cycle. If some data are higher risk than others, you can store these data separately in a more secure storage option.

Once you’ve determined the privacy risk category for each data asset, you can use these categories to inform your choices about how your data can be de-identified, safely used by students/interns, safely transported physically, securely transferred digitally, and securely stored.



Privacy Risk Categorization

Red Data

Privacy Risk Very high-risk
Description Directly identifying data from vulnerable people containing information that elevate the risks of harm
Impact of a Breach • Severity of harm to research subjects and/or damage to the reputation of VU Amsterdam would be very high
• The likelihood of harm or damages after a breach is very high
Example • Video interviews with children talking about abuse

Orange Data

Privacy Risk High-risk
Description Directly identifying data from non-vulnerable people containing information that does NOT elevate the risks of harm
OR
Directly identifying data from vulnerable people containing information that does NOT elevate the risks of harm
OR
Directly identifying data from non-vulnerable people containing information that elevate the risks of harm
OR
• Data from vulnerable people containing information that elevate the risks of harm that has been made slightly less identifiable by removing easily identifying information (e.g. name, contact information)
Impact of a Breach • Severity of harm to research subjects and/or damage to the reputation of VU Amsterdam would be high
• The likelihood of harm or damages after a breach is moderate to high
Example • Key files containing names and contact information of healthy adults

Yellow Data

Privacy Risk Medium to high risk
Description • Data from non-vulnerable people containing information that does NOT elevate the risks of harm that been made slightly less identifiable by removing easily identifying information (e.g. name, contact information)
OR
• Data from vulnerable people and/or containing information that does NOT elevate the risks of harm that has undergone additional de-identification steps beyond the removal of easily identifying information (e.g. name, contact information)
Impact of a Breach • Severity of harm to research subjects and/or damage to the reputation of VU Amsterdam would be moderate to high
• The likelihood of harms or damages after a breach is moderate to low
Example • IP- and MAC-addresses of healthy adults

Green Data

Privacy Risk Medium to low risk
Description • Data from non-vulnerable people containing information that does NOT elevate the risks of harm that has undergone additional de-identification steps beyond the removal of easily identifying information (e.g. name, contact information)
Impact of a Breach • Severity and likelihood of harm to research subjects after a breach are low
• Damage to the reputation of VU Amsterdam is still possible, but likelihood is lower and the impact would be less severe
Example • Highly variable physical measurements (e.g. blood pressure, heart rate, blood glucose, body temperature) from healthy adults

Blue Data

Privacy Risk Little to no risk
Description • Data that cannot be re-identified whatsoever, regardless of the vulnerability of the research subjects or the risks of harm posed by the information
Impact of a Breach • Research subjects will suffer no direct** harm and VU Amsterdam will suffer no damages to its reputation
Examples • Likert scale responses from healthy adults in a questionnaire about benign topics

NB: If your data can still be linked to identifying information (e.g. through participant identifiers that link to a separate key file), the data are not anonymous and therefore not “blue”. Such data would be “green” or “yellow” depending on the sensitivity of the information and vulnerability of the research subjects. If it is possible and appropriate to delete this last link to the identifying information, then the data can be considered anonymous.
**NB: Although research subjects will not be directly harmed, the conclusions drawn from research results or the misuse of published research software can impact the wider population to which the research subjects belong. Such ethical considerations should be discussed with the FGB Scientific and Ethical Review Board.


Confidentiality Risk versus Privacy Risk

Confidentiality and privacy overlap, however, confidentiality is about how a data breach would impact our institution while privacy considers how a data breach would impact our research subjects. Confidentiality concerns have been taken into consideration in the guidance above so that the privacy risks you determine can also be viewed as confidentiality risks.

If your data are not about human subjects, they may still need to be kept confidential, for example, when working with business secrets or intellectual property. If you are working with a third party, especially a business, they may require you to keep their data confidential. Below are some examples of different confidentiality risks for non-human data:

Red Data Orange Data Yellow Data Blue Data*
Confidentiality Risk Very high-risk High-risk Medium-risk Low-risk
Examples • Data that are classified as “secret” • Commercially sensitive data
• Politically sensitive data
• Data subject to non-disclosure agreements
• Patents & other intellectual property
• AI algorithms that could benefit other countries
• Unpublished research output with novel results
• Internal procedures and policies
• Data that can be publicly disclosed
  • Green data aren’t listed here because the primary distinction between Green and Blue privacy risks is that Green data are still technically personal data while Blue data are anonymous or anonymized data. There isn’t a comparable category to Green data when assessing the confidentiality risks of non-personal data, therefore low confidentiality risk data should be handled as Blue data. Be aware that although Blue data can be publicly disclosed, you should assess, when publishing these data, whether there should be any limitations on how these data are reused. Even publicly available data can have limits placed on how they are reused by applying restrictive licenses, such as those that don’t allow reuse for commercial purposes. More information on data licensing is found here.


Important Factors in Privacy Risk


The ease with which research subjects can be re-identified in the data:

  • Data are personal data if they are directly identifying or indirectly identifiable:

    • Directly identifying data are what most people think of as personal data: name, contact information, facial images etc. This information isn’t always directly identifying (e.g. a name like Jan Smit is very common in The Netherlands, so that name alone might not be enough to identify someone). Regardless, it’s generally agreed that these types of data should be handled with extra care and that these types of data should be ideally stored separately from other research data.
    • Indirectly identifiable data can also be referred to as pseudonymous data (although there are some caveats1 to this). The ease with which a research subject could be re-identified from indirectly identifiable data depends on several factors such as:
      • How much information has been collected about each research subject?
      • How specific is the information about each research subject?
      • How unique is the information about each research subject?
        • Unique information may be a result of one variable with extreme values or a combination of several variables that create a record that is distinct from all others in the dataset
      • Could the data be linked to publicly available information, such as social media profiles?
    • It is very important to consider how easily indirectly identifiable data can be re-identified. If only the most cursory of efforts has been made to de-identify the data (most commonly, only removing names and contact information from the dataset), then the research subjects usually remain at risk of re-identification due to the wealth of other information present in the dataset. If someone tells you a dataset is “pseudonymous”, it is also a good idea to assess how easily these data could be re-identified, since not everyone has the same idea about what pseudonymous means1.
  • As long as data are identifiable, they cannot be referred to as anonymous data and the legal rules of the GDPR must be followed. This means the GDPR applies to pseudonymous data.

  • Not all identifiable data pose the same level of privacy risk: oftentimes, raw data often pose higher privacy risks because of greater re-identifiability (e.g. video recordings), and therefore require additional data protection measures (such as high security storage); as the data are cleaned, recoded and analysed they become less identifiable (e.g. coded interactions) and require fewer data protection measures. De-identification is therefore an important part of data processing that can be used to protect the privacy of your research participants; the identifiability of the data is the only factor you can change since it’s not possible to reduce the vulnerability of your research subjects or the sensitivity of your research topic.


The vulnerability of the research subjects:

  • Vulnerable research subjects have an additional risk of harm (socially, physically, emotionally, financially) if their personal information is made public. The greater the vulnerability of the research subjects, the greater the potential for serious harm.

  • Vulnerable research subjects include, but are not limited to:

    • children
    • people who identify as LGBTQ+
    • refugees
    • ethnic or religious minorities
  • When assessing the vulnerability of the research subjects, bear in mind that we are looking at what vulnerability characteristics are known or disclosed within the data. If you are not actively studying a specific vulnerable population and you do not need to collect data that could identify someone as vulnerable, then that information should not be collected in order to minimize the risks to the research subjects. For example, if you do not need to know the ethnicity or sexuality of your research subjects, you should not collect that information within your research data. That way you can minimize the overall privacy risks by minimizing the vulnerabilities of the research subjects studied. However, there may be situations where collection of this information is unavoidable, e.g. video recordings of research subjects will also collect skin colour and any religious clothing. Keep such cases in the back of your mind when assessing the overall privacy risks; even if you are not actively studying a vulnerable population, consider whether you will collect data that could identify someone as part of a vulnerable population when you are assessing the risks of harm to the research subject.


The risk of harm posed by the information being used:

  • The risk of harm posed by the information in the data is the most complex and subjective consideration in assessing the privacy risks. Your goal is to consider what information your research data contain and whether that information, either in combination with or in addition to the vulnerability of your research subjects, poses elevated risks of harm to the research subjects. Information that could increase the risks of harm to your research subjects include subject matter that is sensitive, taboo or stigmatized in nature. The risk of harm is elevated if the research subjects would experience further risks to their well-being if the information is made public beyond the disclosure of their vulnerability characteristics. For example, a dataset containing the names of LGBTQ research subjects is only high-risk (orange data) as long as the only information leaked is the research subjects’ names. Such a situation is not ideal, but no further information beyond names has been disclosed. However, if the leaked information included information about the research subjects’ sexuality and their sexual experiences, that would constitute an elevated risk of harm due to the information contained in the data (i.e. very high-risk or red data). If the leaked data included just names, but it was known that the research population being studied was entirely LGBTQ, this may or may not constitute an elevated risk of harm, i.e. the data may be orange or red. The data would be more likely orange if your population consists of openly queer research subjects, whereas it would be better to consider the data red if your subjects are still in the closet. Because of these complex nuances, it is imperative that you thoroughly contemplate how much damage the public disclosure of your research data would pose to your research subjects.
  • Some other topics that may potentially increase the risk of harm to your subjects, include:
    • employment status
    • income and other financial data
    • student grades and performance
    • location data

The risk of harm in these cases will often depend on the vulnerability characteristics of the research subjects, for example, employment status of the average Dutch person is someone sensitive, but does not pose an elevated risk, while employment status of an asylum seeker could have political ramifications for the research subject in question and may risk their safety and well-being.

  • If the data are “special” under the GDPR the risk of harm may be elevated but this is not always the case; if your research data fall into the “special” category it is a warning sign to thoroughly assess for elevated risks of harm posed by the data.
    • “Special” data are subject to additional legal requirements under the GDPR regardless of the risks of harm posed by these data. More information is found on the GDPR Take-Home Points page
    • If your data pose additional risks to the research subjects, but are not “special” data types under the GDPR, they do not need to meet the additional legal requirements for “special” data. However, these data should of course be treated with extra care because of the risk to the research subjects’ privacy and well-being.
  • Elevated risks of harm can also apply to non-personal data, such as a company’s financial records. If the disclosure of this information could negatively impact an individual or organisation, this suggests a potentially elevated confidentiality risk.



  1. The legal definition of pseudonymization according to the GDPR is quite strict. Essentially, according to the GDPR, pseudonymous data requires additional information for the data to be re-identified and if that additional information is deleted, the data become anonymous. In real life, the situation is more complicated. A dataset with no directly identifying data may still contain indirectly identifiable data (often demographic information) that can be used to single out unique records, which could then be used to re-identify people using publicly available information or based on context clues. You could de-identify this dataset further and, if done correctly, you would ensure that the only way to re-identify the research subjects would be with an identification code and a key file. Both the former and latter versions of this dataset would be called “pseudonymized” by a layperson, but under the GDPR only the latter version is legally considered to be pseudonymized. The main takeaway from this is that even if someone says their data are pseudonymous, you should investigate to what extent the data have been pseudonymized: do they mean the GDPR’s strict definition of pseudonymized or do they simply mean that the directly identifying data have been removed from the dataset?↩︎