You will see a lot of different words used to describe the process of making it harder or impossible to identify a specific person within your data: anonymization, pseudonymization, de-identification etc. The definitions of these terms can vary depending on context and opinion. For the purposes of this guide, de-identification will be used and the focus is mainly on making it as hard as possible to re-identify people in the data. It’s important to also be aware that it isn’t always feasible or appropriate to de-identify data so much that it is impossible to re-identify people: you might be able to achieve anonymous data this way, but in doing so you modify your data so much that you render it useless.
If de-identification doesn’t necessarily create data that are impossible to re-identify, why do it? Because if you de-identify your data you can lower the privacy risk categorization of your data, which gives you more choice in determining:
There are other advantages such as helping to maintain blinding in an RCT and, obviously, safeguarding the privacy of research subjects, which helps maintain public trust.
The following steps serve as general guidance on how to de-identify your data. Different types of data and different purposes will affect how you carry out de-identification in your research, but you can use this guidance to assist you in the process. Additionally:
Many of the following steps will involve modifying your data, e.g. deleting certain variables, deleting unnecessary datasets, modifying categories, removing outliers etc. Before permanently modifying your data, consider whether you need to maintain a copy of the raw, unaltered data. This is most relevant for data that will be used in your analysis (as opposed to data that are just for communication with participants), especially if those raw data cannot be easily replicated. Ask yourself “Would someone need to review this data to be able to confirm my findings in the future?”. Also note that if you are conducting WMO-applicable research and/or research that falls under the Good Clinical Practice guidelines, that there are stricter requirements for maintaining your data; if these laws/guidelines apply to your research, check whether or not it’s appropriate to permanently modify any of your data (including data necessary for communication with participants).
Make sure to document the steps you take to create your de-identified datasets: this documentation should, preferably, be in the form of a syntax or other programming code, but at a minimum you should document all modifications in a logbook.
Your research goals and the nature of your data may influence how far you can go in the de-identification process. You may only be able to complete steps 1-4; perhaps even that is not feasible. Go as far as you can in the de-identification process and once you’ve reached the endpoint that is feasible for your research, reassess the privacy risks posed by your data.
If you haven’t done so already, stop what you’re doing and go start your data management plan. You can use DMPonline to write a DMP; it’s designed to guide you through the process. You don’t need to complete an entire plan to figure out how to de-identify your data; you do, however, need to understand the nature of the data assets you are using in your research, especially what these data will look like at the start of your research, as well as the form the data will ultimately need to take for you to carry out your analyses.
Some forms of raw data assets used at the start of research are:
Once you have an idea of what your data assets are, consider the variables that you plan to collect for each data asset and determine whether any variables are direct identifiers. Also assess whether a direct identifier will be used in your analysis. For example, names and contact information, which are direct identifiers, are usually collected because you need to stay in touch with participants, but these variables don’t usually need to be used in the analysis. Alternatively, facial images in video recordings, which are also are direct identifiers, may actually be essential for your analysis. Also consider whether the direct identifiers you are planning to collect even need to be collected for you to conduct your research: ask yourself if you could carry out your research without collecting some or all of these direct identifiers?
Consider whether or not you need to collect direct
identifiers simultaneously with the other research data. For example,
can you carry out one process that collects the direct identifiers and a
separate process that collects the data you need for your research?
Considerations for how you can apply this concept are shown below for a
variety of data types:
NB: It’s strongly recommended to avoid collecting direct identifiers alongside your other data, but sometimes this cannot be avoided (e.g. recording facial images is necessary to observe and analyse facial expressions or personal interactions). At a minimum, take to the time to assess whether you are collecting direct identifiers unnecessarily or whether there are alternative means to collect directly identifying information so that it can be kept separate from your other research data at the outset.
Is it feasible to record only audio rather than audio and video?
Can you name audiovisual files (e.g. recordings of interviews) with a participant ID number rather than a participant name?
Is it feasible not to discuss specific names and places while recording an interview?
Is it feasible to avoid recording faces?
Is it feasible to record measurements while only identifying the participant with a participant ID number?
Will your method of collecting data collect direct identifiers that you don’t need (e.g. location data from a smart device)? If so, what can you do to avoid this?
Is it feasible to collect questionnaire responses without also asking for name and contact information in the same questionnaire?
Is it absolutely necessary to have long open text fields in your questionnaire, such as a “Comments” section?
If you already have the e-mail addresses of your participants, is it feasible to invite them to participate in the questionnaire without uploading those e-mail addresses to the survey tool?
Will the survey tool collect direct identifiers that you don’t need (e.g. IP-addresses)? If so, what can you do to avoid this?
If step 3 is not feasible for some or all of the direct identifiers, then the directly identifying variables should be separated from the research data after data collection is complete (remember that this only applies to direct identifiers that are not required for your analyses). If you need direct identifiers to keep track of who your participants are, you should use these identifiers to create a key file that you can link to the research data with a random participant identification number. This key file should be kept separate from your other research data, either by storing it on a different storage solution or, if on the same storage solution, in a separate, encrypted folder that can only be opened by those who absolutely require access.
NB: If your data contain direct identifiers that are necessary for your analysis, the following steps are still useful. In step 4, you can still consider how to separate those direct identifiers that are not required for your analysis and in step 5, you can apply methods to make the indirectly identifying information in your data less identifying, if appropriate. Just remember that when you assess the privacy risks after de-identification is done, you still have some directly identifying information present that will impact these risks.
If for some reason the files for audiovisual recordings needed to temporarily be named based on directly identifying information about the participant, change those file names as soon as possible to the participant ID number.
If appropriate for your research, make a transcript of the audiovisual data and then follow the recommendations for textual data.
If appropriate for your research, blur facial images and use voice modification
If your research requires that you be able to re-identify your participants for follow-up and communication purposes, the consent form is your best place to collect the information for this purpose. Whether you are using paper or digital consent forms, you can use the information collected via these forms to create your key file. Your key file can also include details about what the participant consented to (e.g. future follow-up research, sharing of their data, optional sub-studies etc.) so that you can keep track of this information in the future.
If you determined in step 3 that you don’t need to collect directly identifying information with your consent forms for your research, then you don’t necessarily need to create a key file. The only reason you may want to do so is if you ask for consent to additional purposes beyond your current research. For example, if your subjects were asked to consent to the sharing of their data for future research and some participants don’t consent to this, then you need to keep track of which research data belong to those participants so that their data are not shared for new research. You could therefore create a file of participant identification numbers and the various conditions they did or did not consent to.
If for some reason the imaging files needed to temporarily be named based on directly identifying information about the participant, change those file names as soon as possible to the participant ID number.
If it couldn’t be avoided that direct identifiers were included in the image, remove this information from the imaging data and update the key file with this information as needed.
If collecting neuroimaging such as fMRIs, deface the images.
If you need to collect questionnaire data that includes direct identifiers, export the data from the survey tool and review the directly identifying variables for anything new that should be added to the key file. Once you’ve updated the key file, you should delete the directly identifying variables from the exported questionnaire data. Remember that this does not apply to any directly identifying variables that you plan to use in your analyses; any information used for your analyses needs to be maintained for research integrity purposes.
Take extra care to review open text fields, particularly “Comments” fields. Look for any self-reporting from the participant regarding their name, changes to their address etc. Depending on how you are using your “Comments” sections, you may want to simply copy the information provided by the participant over to the key file, thereby creating a comments variable in the key file. Alternatively, you may just wish to review the information the participant supplied and update the information already present in the key file. If the comment only consists of information relevant to the key file, you can delete it once the key file has been updated. If the comment also includes anything relevant to your research analyses, then keep the information relevant to your research and remove any of the information that you’ve already included in the key file.
Carrying out this step will depend on the goals of your research and the statistical testing you aim to do. This means that this step may not be feasible or appropriate for every situation. This step may also only be applicable when you decide you want to share your data with others and therefore need to make it as unidentifiable as possible. Whatever your situation, it is worthwhile to review this step and see if at least some of the suggestions below can be applied to your data to reduce their identifiability.
The goals of this step is to look at your research data and assess whether it would be possible to single out at least one unique record about one of your research subjects:
When doing this assessment, don’t forget to consider all of your data assets. For example, you may have a series of blood pressure measurements (which are generally not identifying), but if these measurements are linked via participants ID number to a key file or to detailed questionnaire data, then the blood pressure measurements are still identifiable. Additionally, don’t forget that information about your research subjects may be known due to context, such as what you report in your research methods, even if that information isn’t included in the data itself.
Before considering the various types of data, you should have an idea of what variables are indirect identifiers. Look at the variables in your dataset and assess what information could be relevant to re-identifying your research subjects. This is not just information like demographics (age, education, occupation, ethnicity etc.), but could also be specific dates, rare medical conditions, details about a specific event etc. Once you’ve determined which variables may be identifying, you need to decide to what extent you can modify the content of that variable, while still maintaining research data that are useful for your analysis. Examples are given below for several types of data. You may notice that, for some examples, de-identification happens simultaneously with the data processing you planned to do anyways (e.g. grouping categories with only a few records into larger categories, coding textual information into quantitative variables, coding observations etc.).
Are there potentially indirectly identifying features in the recording that could be blurred, e.g. unique tattoos?
Depending on the nature of the research, can the information in the audiovisual recording be recoded into something less identifiable, but more useful for analysis, e.g. coding of observed interactions between subjects?
Remove any indirectly identifying data included in the image, e.g. age and weight may have automatically been included for calibration purposes. If this information must be maintained for your analysis or for proper interpretation of the data, store this information in a tabular dataset and link the data together via a participant ID number.
Consider whether the data could be spatially normalized and still useful for analysis
Is there information present that is pretty unique to the subject or when combined with other details or context could be indirectly identifying? If so can this information be replaced with a pseudonym as described in the methods for qualitative data described here?
Can the information be coded into overarching themes rather than highly unique, personal stories?
Can quotations of the data be created that express the relevant themes, but are generic enough to not be identifying?
Once you’ve gone through these steps, you’ve made your data a lot less identifiable and perhaps also a lot easier to analyse. Before assuming that your data are anonymous at this point, it’s important to be aware that for data to be considered anonymous by European legal bodies all of the three following considerations apply:
These requirements are pretty difficult to meet. For example, MRI data will always be unique to an individual, unless you spatially normalize it. This doesn’t mean that it’s impossible to achieve anonymous data, but there may be further work required to get to that level and the anonymized data may not be so useful anymore for your planned research analyses.
The most important thing to remember is that after you’ve gone through the de-identification steps in this guide, check whether the three conditions above apply before assuming that your data are now anonymized.
Lastly, don’t forget that most research data doesn’t exist in a bubble; you rarely just have imaging data or questionnaire data or physical measurements. The more data you have about a person the easier they are to re-identify.