Data about human beings are very rarely anonymous. Even if the data do not contain information such as name or address, the data may still pose privacy risks to the people they are about, especially if it is possible to re-identify individuals within the data you are using for your research.
The table below can be used to determine privacy risk categories for different types of data. There will always be grey areas when looking at privacy risk, particularly when considering the vulnerability of the research subjects, the sensitivity of the information and the re-identifiability of the data. The guidance below cannot capture every possible situation, so, like all good things, think of this advice as a spectrum. If in doubt, opt for a higher risk category. Also, not all of the data in your research will have the same level of privacy risk and the privacy risks for each type of data may change as you clean, recode and modify the data from a raw to processed form. You should assess the risks for each separate data asset and also consider how each data asset’s risk will change over the course of the research life cycle. If some data are higher risk than others, you can store these data separately in a more secure storage option.
Once you’ve determined the privacy risk category for each data asset, you can use these categories to inform your choices about how your data can be de-identified, safely used by students/interns, safely transported physically, securely transferred digitally, and securely stored.
Privacy Risk | Very high-risk |
Description | • Directly identifying data from vulnerable people about sensitive topics |
Impact of a Breach | • Severity of harm to research subjects and/or damage to the
reputation of VU Amsterdam would be very high • The likelihood of harm or damages after a breach is very high |
Examples | • Video interviews with children talking about abuse • Raw transcripts of interviews with refugees talking about their home country • Open text responses (e.g. diary-type feedback) from patients with mental or physical conditions/disabilities • Open text responses or detailed interviews with employees describing their satisfaction with their employer • Raw neuroimages (not de-faced) of vulnerable subjects with serious medical conditions • Genetic data from vulnerable subjects that indicates a risk for disease or disorders |
Privacy Risk | High-risk |
Description | • Directly identifying data from non-vulnerable people about non-sensitive topics OR • Directly identifying data from vulnerable people about non-sensitive topics OR • Directly identifying data from non-vulnerable people about sensitive topics OR • Data from vulnerable people about sensitive topics that has been made slightly less identifiable by removing easily identifying information (e.g. name, contact information) |
Impact of a Breach | • Severity of harm to research subjects and/or damage to the
reputation of VU Amsterdam would be high • The likelihood of harm or damages after a breach is moderate to high |
Examples | • Key files containing names and contact information of research
subjects • Data containing date of birth and 6-digit postal code of research subjects • Video observations of children playing • Video observations of team-building activities • Raw neuroimages (not de-faced) of non-vulnerable subjects • Raw questionnaire data about sensitive topics • Raw questionnaire data from vulnerable subjects containing detailed demographic information • Genetic data from non-vulnerable subjects |
Privacy Risk | Medium to high risk |
Description | • Data from non-vulnerable people about
non-sensitive topics that been made slightly
less identifiable by removing easily identifying
information (e.g. name, contact information) OR • Data from vulnerable people and/or about sensitive topics that has undergone additional de-identification steps beyond the removal of easily identifying information (e.g. name, contact information) |
Impact of a Breach | • Severity of harm to research subjects and/or damage to the
reputation of VU Amsterdam would be moderate to high • The likelihood of harms or damages after a breach is moderate to low |
Examples | • IP- and MAC-addresses of research subjects • Raw questionnaire data from non-vulnerable subjects containing demographic information • Questionnaire data about sensitive topics and/or vulnerable people that has been processed to make re-identification more difficult • Video recordings with faces blurred and voices modified • Transcripts of interviews in which the identifying information is replaced with pseudonyms • Repeated physical measurements on vulnerable subjects that include the dates and times the measurements occurred • De-faced neuroimages of vulnerable people • Extensive kinematic measurements that are used to identify sensitive information such as movement disorders |
Privacy Risk | Medium to low risk |
Description | • Data from non-vulnerable people about non-sensitive topics that has undergone additional de-identification steps beyond the removal of easily identifying information (e.g. name, contact information) |
Impact of a Breach | • Severity and likelihood of harm to research subjects after a
breach are low • Damage to the reputation of VU Amsterdam is still possible, but likelihood is lower and the impact would be less severe |
Examples | • Data that contain a unique record for at least one research
subject, e.g.: - De-faced neuroimages of non-vulnerable people - Extensive kinematic measurements from non-vulnerable subjects - Any other datasets that contain sufficient information to create a unique record for one or more research subjects |
Privacy Risk | Little to no risk |
Description | • Data that cannot be re-identified whatsoever, regardless of the vulnerability of the research subjects or the sensitivity of the information |
Impact of a Breach | • Research subjects will suffer no direct** harm and VU Amsterdam will suffer no damages to its reputation |
Examples | • Highly variable physical measurements, e.g. blood pressure, heart
rate, blood glucose, body temperature • Likert scale responses in a questionnaire • Coded qualitative data • Summary statistics NB: If your data can still be linked to identifying information (e.g. through participant identifiers that link to a separate key file), the data are not anonymous and therefore not “blue”. Such data would be “green” or “yellow” depending on the sensitivity of the information and vulnerability of the research subjects. If it is possible and appropriate to delete this last link to the identifying information, then the data can be considered anonymous. |
Confidentiality and privacy overlap, however, confidentiality is about how a data breach would impact our institution while privacy considers how a data breach would impact our research subjects. Confidentiality concerns have been taken into consideration in the guidance above so that the privacy risks you determine can also be viewed as confidentiality risks.
If your data are not about human subjects, they may still need to be kept confidential, for example, when working with business secrets or intellectual property. If you are working with a third party, especially a business, they may require you to keep their data confidential. Below are some examples of different confidentiality risks for non-human data:
Red Data | Orange Data | Yellow Data | Blue Data* | |
---|---|---|---|---|
Confidentiality Risk | Very high-risk | High-risk | Medium-risk | Low-risk |
Examples | • Data that are classified as “secret” | • Commercially sensitive data • Politically sensitive data • Data subject to non-disclosure agreements |
• Patents & other intellectual property • AI algorithms that could benefit other countries • Unpublished research output with novel results • Internal procedures and policies |
• Data that can be publicly disclosed |
Vulnerable research subjects have an additional risk of harm (socially, physically, emotionally, financially) if their personal information is made public. The greater the vulnerability of the research subjects, the greater the potential for serious harm.
Vulnerable research subjects include, but are not limited to:
The vulnerability of the research subjects can also depend on the context of the research, e.g. employees in organizational psychology research; students in learning analytics research. These contextual risks can also compound the risks for research subjects with other “typical” vulnerability characteristics, e.g. employees who are immigrants.
Sensitive data include “special” data types that receive extra legal attention under the General Data Protection Regulation (GDPR):
Sensitive data are also any information that is considered sensitive by the general public, such as:
Data may also be more sensitive because the research subjects are more vulnerable, e.g. a refugee describing their experiences in their home country.
Sensitive data that are not included in list of GDPR “special” data types do not need to meet the additional legal requirements for “special” data; however, all sensitive data should be treated with extra care because of the risk to the research subjects’ privacy.
Data are personal data if they are directly identifying or indirectly identifiable:
As long as data are identifiable, they cannot be referred to as anonymous data and the legal rules of the GDPR must be followed. This means the GDPR applies to pseudonymous data.
Not all identifiable data pose the same level of privacy risk: oftentimes, raw data often pose higher privacy risks because of greater re-identifiability (e.g. video recordings), and therefore require additional data protection measures (such as high security storage); as the data are cleaned, recoded and analysed they become less identifiable (e.g. coded interactions) and require fewer data protection measures. De-identification is therefore an important part of data processing that can be used to protect the privacy of your research participants; the identifiability of the data is the only factor you can change since it’s not possible to reduce the vulnerability of your research subjects or the sensitivity of your research topic.
The legal definition of pseudonymization according to the GDPR is quite strict. Essentially, according to the GDPR, pseudonymous data requires additional information for the data to be re-identified and if that additional information is deleted, the data become anonymous. In real life, the situation is more complicated. A dataset with no directly identifying data may still contain indirectly identifiable data (often demographic information) that can be used to single out unique records, which could then be used to re-identify people using publicly available information or based on context clues. You could de-identify this dataset further and, if done correctly, you would ensure that the only way to re-identify the research subjects would be with an identification code and a key file. Both the former and latter versions of this dataset would be called “pseudonymized” by a layperson, but under the GDPR only the latter version is legally considered to be pseudonymized. The main takeaway from this is that even if someone says their data are pseudonymous, you should investigate to what extent the data have been pseudonymized: do they mean the GDPR’s strict definition of pseudonymized or do they simply mean that the directly identifying data have been removed from the dataset?↩︎