Anonymization techniques

Statistical disclosure limitation methods can be classified in two categories:

  • Methods based on data reduction. Such methods aim at increasing the number of individuals in the sample/population sharing the same or similar identifying characteristics presented by the investigated statistical unit. Such procedures tend to avoid the presence of unique or rare recognizable individuals.
  • Methods based on data perturbation. Such methods achieve data protection from a twofold perspective. First, if the data are modified, re‑identification by means of record linkage or matching algorithms is harder and uncertain. Secondly, even when an intruder is able to re-identify a unit, he/she cannot be confident that the disclosed data are consistent with the original data.

An alternative solution consists in generating synthetic microdata.

Removing variables

The first obvious application of this method is the removal of direct identifiers from the data file. A variable should be removed when it is highly identifying and no other protection methods can be applied. A variable can also be removed when it is too sensitive for public use or irrelevant for analytical purpose. For example, information on race, religion, HIV, etc. might not be released in a public use file while they might be released in a licensed file.

Removing records

Removing records can be adopted as an extreme measure of data protection when the unit is identifiable in spite of the application of other protection techniques. For example, in an enterprise survey dataset, a given enterprise may be the only one belonging to a specific industry. In this case, it may be preferable to remove this particular record rather than removing the variable "industry" from all records. Since it largely impacts the statistical properties of the released data, removing records has to be avoided as much as possible.

When the records to be removed are selected according to a sampling design the method is referred to as sub-sampling (or sampling when the original matrix represents census data).

Global recoding

The global recoding method consists in aggregating the values observed in a variable into pre-defined classes (for example, recoding the age into five-year age groups, or the number of employees in three-size classes: small, medium and large). The method applies to numerical variables, continuous or discrete. It affects all records in the data file.

When dealing with categorical variables (or numerical categorized), the global recoding method collapses similar or adjacent categories.

Consider, for example, the variable "Marital status" that is often observed in the following categories: Single, Married, Separated, Divorced, Widow. The sample frequency of the Separated category might be low, especially when cross-tabulated with other variables. The two adjacent categories Separated and Divorced can be joined into a single one "Separated or Divorced". The observed frequencies of the combinations involving this new category would be higher than those relative to Separated and Divorced separately. The categories to be joined are chosen considering the data utility as well as the statistical control of the frequencies.

The method can also be applied to key variables (such as geographic codes) to reduce their identifying effect.

Top and bottom coding

Top and bottom coding can be referred to as a special case of global recoding that can be applied to numerical or ordinal categorical variables. The variables "Salary" and "Age" are two typical examples. The highest values of these variables are usually very rare and therefore identifiable. Top coding at certain thresholds introduces new categories such as "monthly salary higher than 6000 dollars" or "age higher than 75", leaving unchanged the other observed values. The same reasoning applied to the smaller observed values defines bottom coding. When dealing with ordinal categorical variables, a top (or bottom) category is defined by aggregating the "highest" (or "smallest") categories.

Local suppression

Local suppression consists in replacing the observed value of one or more variables in a certain record with a missing value. Local suppression is particularly suitable for the setting of categorical key variables and when combinations of scores on such variables are at stake. In this case, local suppression consists in replacing an observed value in a combination with a missing value. The aim of the method is to reduce the information content of rare combinations. The result is an increase in the frequency count of records containing the same (modified) combination. For example, suppose the combination "Marital status=Widow; Age=17" is a population unique. If the information on Age is suppressed, the combination "Marital status=Widow; Age=missing" will not be identifying anymore. Alternatively, one can decide to suppress the information on Marital status as well. A criterion is therefore necessary to decide which variable in the risky combinations has to be locally suppressed. The main criterion is obviously to minimize the number of local suppressions. For example, consider the values of key variables, "Sex=Female;Marital status=Widow; Age=17; Occupation=Student" , observed in a unit. Both the combinations "Marital status=Widow; Age=17" and "Sex=Female;Marital status=Widow; Occupation=Student" characterize the unit and may be population unique (combinations at risk). In order to minimize the number of local suppressions, one can choose to replace the variable Marital status with missing values. By doing so, both combinations are simultaneously protected using a single local suppression. If the variables were considered independently, two local suppressions would be required. Another criterion can be defined according to a measure of information loss (for example, the value minimizing an entropy indicator might be selected for local suppression). Moreover, suppression weights can be assigned to the key variables in order to drive the local suppression to less important variables. Local suppression also requires a selection criterion for the records. In the previous paragraph, several rules defining a record at risk have been presented. Local suppression could be applied only to risky records (records that contain combinations at risk).

References

Micro-aggregation

Micro-aggregation is a perturbation technique first proposed by Eurostat as a statistical disclosure method for numerical variables. The idea is to replace an observed value with the average computed on a small group of units (small aggregate or micro-aggregate), including the investigated one. The units belonging to the same group will be represented in the released file by the same value. The groups contain a minimum predefined number k of units. The k minimum accepted value is 3. For a given k, the issue consists in determining the partition of the whole set of units in groups of at least k units (k-partition) minimizing the information loss usually expressed as a loss of variability. Therefore, the groups are constructed according to a criterion of maximum similarity between units. The micro-aggregation mechanism achieves data protection by ensuring that there are at least k units with the same value in the data file.

When micro-aggregation is independently applied to a set of variables, the method is called individual ranking. When all the variables are averaged at the same time for each group, the method is called multivariate micro‑aggregation.

The easiest way to group records before aggregating them is to sort the units according to their similarity and the values resulting from this criterion, and to aggregate consecutive units into fixed size groups. Size adjustment is eventually required for the first or last group. For univariate micro-aggregation, the sorting criterion may be the variable itself.

Example:

For multivariate micro aggregation, similarity can be used as a criterion for the observed variables or, to increase the effectiveness of the method, it can be defined as a combination of variables. For example, the first principal component or the sum of Z-scores values along the set of variables have been proposed as criteria for fixed size micro-aggregation.

Multivariate micro-aggregation is considered much more protective than individual ranking because the method guarantees that at least k units in the file are identical (all the variables are averaged at the same time), but the information loss is higher.

References

Data swapping

Data swapping was initially proposed as a perturbation technique for categorical microdata, and aimed at protecting tabulation stemming from the perturbed microdata file. Data swapping consists in altering a proportion of the records in a file by swapping values of a subset of variables between selected pairs of records (swap pairs).

The level of data protection depends on the perturbation level induced in the data. A criterion needs to be applied to determine which variables and which records (the swapping rate) have to be swapped. For categorical data, swapping is frequently applied to records that are sample unique or sample rare, as these records usually present higher risks of re-identification.

Finding data swaps that provide adequate protection while preserving the exact statistics of the original database is impractical. Even when the univariate moments are maintained, data swapping generally modifies the data too much.

References

Post-randomization (PRAM)

As a statistical disclosure control technique, post-randomization (PRAM) induces uncertainty in the values of some variables by exchanging them according to a probabilistic mechanism. PRAM can therefore be considered as a randomized version of data swapping. As with data swapping, data protection is achieved because an intruder cannot be confident whether a certain released value is true, and therefore matching the record with external identifiers can "easily" lead to mismatch or attribute misclassification. The method has been introduced for categorical variable but it can be generalized to numerical variables as well.

References

Adding noise

Adding noise consists in adding a random value ε, with zero mean and predefined variance σ2, to all values in the variable to be protected. Generally, methods based on adding noise are not considered very effective in terms of data protection.

References

Resampling

Resampling is a protection method for numerical microdata that consists in drawing with replacement t samples of n values from the original data, sorting the sample and averaging the sampled values. Data protection level guaranteed by this procedure is generally considered quite low.

Synthetic microdata are an alternative approach to data protection, and are produced by using data simulation algorithms. The rationale for this approach is that synthetic data do not pose problems with regard to statistical disclosure control because they do not contain real data but preserve certain statistical properties. Initially, Rubin proposed synthetic data generation through multiple imputations, while Feinberg proposed to use bootstrap methods. Additional approaches have been suggested, such as multiple imputation, bootstrapping, Latin hypercube sampling, modelling, and data distribution by probability.

Generally, users are not keen to work with synthetic data as they cannot be confident of the results of their statistical analysis. Nevertheless, this approach can also help to producing “test microdata set.” In this case, synthetic data files would be released to allow users to test their statistical procedures to successively access “true” microdata in a data enclave.

References