Anonymization principles

Anonymizing a micro-dataset consists in removing or modifying the identifying variables contained in the dataset. "Typically an identifying variable is one that describes a characteristic of a person that is observable, that is registered (identification numbers, etc.), or generally, that can be known to other persons." (µ Argus Manual)

Identifying variables include:

  • Direct identifiers, which are variables such as names, addresses, or identity card numbers. They permit direct identification of a respondent but are not needed for statistical or research purposes, and should thus be removed from the published dataset.
  • Indirect identifiers, which are characteristics that may be shared by several respondents, and whose combination could lead to the re-identification of one of them. For example, the combination of variables such as district of residence, age, sex, and profession would be identifying if only one individual of that particular sex, age and profession lived in that particular district. Such variables are needed for statistical purposes, and should thus not be removed from the published data files. Anonymizing the data will consist in
    1. determining which variables are potential identifiers (this relies on one's personal judgement), and in
    2. modifying the level of precision of these variables to reduce the risk of re-identification to an acceptable level. The challenge is to maximize the security while minimizing the resulting information loss.

It should be noted that the disclosure risk does not only depend on the presence of identifying variables in the dataset, but also on:

  • The existence of an intruder, which in turn depends on the potential benefit this intruder would reap from re-identification. For some types of data such as business data, the intruder's motivation can be high. For other types of datasets, like household surveys in developing countries, the motivation would typically be much lower as there is little to gain in re-identifying respondents.
  • What other data are available to the intruder. Often, re-identification is done by matching data from various sources (for example, matching sample survey data with administrative registers).
  • The cost of re-identification. The higher the cost, the lower the benefit for an intruder.

To account for these various parameters, a disclosure scenario must be defined as a first step in the anonymization process. Scenarios can be classified in two categories:

  • Nosy neighbor scenarios. These scenarios assume the intruder has enough information on a single unit, or a few of them, and this information stems from his/her personal knowledge. In other words, the intruder belongs to the circle of acquaintances of a statistical unit.
  • External archive scenarios. Such scenarios are based on the assumption that the intruder can link records belonging to the distributed dataset to records from another available dataset (or register) which contains direct identifiers. The intruder does so by using identifying variables available in both datasets as merging keys (data matching).

Conservative assumptions are often made in order to define a worst case scenario.

When producing microdata file, one should always keep the user perspective in mind. It is fundamental that the released file meet the researcher's requirements. Both information content and the choice of protection methods have to focus as much as possible on the user's needs. Knowledge of the statistical analysis the users generally wants to perform helps deciding the anonymization strategy.