INTRODUCTION
The safe dissemination of data is one of the main concerns
of National Statistical Agencies. The released data can be classified
as disaggregated or aggregated. Disaggregated data (a.k.a. microdata
or microfiles) consists of files of records, each record providing
the values for a set of variables of an individual. Aggregated data (a.k.a.
tabular data) is obtained from microdata crossing two or more variables,
which results in sets of tables with a likely large number of cells.
It must be guaranteed, for both types of data, that no individual information
can be derived from the released information. The available methods for
this purpose belong to the field of statistical disclosure control. Good
introductions to the state-of-the-art in this field can be found in the
monographs [ 25,
16].
In the project we first and mainly focus on tabular data protection. Although
each cell of the table shows aggregated information for several individuals,
there is a risk of disclosing individual data. This is clearly shown
in the below tables. The table (a) gives the average salary for
age interval and ZIP code, while table (b) shows the number of individuals
for the same variables. If there was only one individual in ZIP code
z2 and age interval 51-55, then any external attacker would know the salary
of this single person is 40000EUR. For two individuals, any of them could
deduce the salary of the other, becoming an internal attacker. Usually,
cells showing information about few individuals are considered sensitive,
although other rules can be used in practice. Methods for detecting
sensitive cells are out of the scope of this project. A recent discussion
about sensitivity rules can be found in [
17, 24].
... | z1 |
z2 |
... | |
... | ... | ... | ... | ... |
51-55 | ... | 38000EUR | 40000EUR | ... |
56-60 | ... | 39000EUR | 42000EUR | ... |
... | ... | ... | ... | ... |
(a) |
... |
z1 |
z2 |
... | |
... | .. | ... | ... | ... |
51-55 | ... | 20 | 1 or 2 | ... |
56-60 | ... | 30 | 35 | ... |
... | ... | ... | ... | ... |
(b) |
The above tables show a two-dimensional example. This can be
considered the simplest case. However, in practice we must deal with
more complex situations, including multidimensional, hierarchical and
linked tables. Multidimensional tables are obtained crossing more than
two variables, and they can be individually protected. Hierarchical tables
are sets of tables whose variables have a hierarchical relation (e.g.,
ZIP code and city). In that case, the total or marginal cells of some tables
are internal ones for the others. They have to be protected together, to
avoid the disclosure of sensitive data. Finally, linked tables are a generalization
of the previous situation, where several tables are made from the same
microdata, thus sharing information or cells, either hierarchical or not.
Again, they have to be protected together. Linked tables can deal with any
table dimension, size and structure, and thus include the other situations.
Dealing with linked tables is a desired feature of any tabular protection
method. Eventually, the final goal would be the protection of the whole
set of linked tables that can be produced from some microfiles (e.g.,
a population census). Clearly, the number of cells involved in that case
might be of several millions, an impractical size for most current tabular
protection techniques. The family of protection methods considered in this
project deal with linked tables, and can solve real-world large instances
in few seconds. All the above situations can both refer to frequency tables
(i.e., cell values are integer and are usually associated to the number of
individuals in that cell) or magnitude tables (i.e., cell values are real,
and, for instance, they show the mean for some other variable of all the
individuals in that cell). In the project we mainly focus on tables of
magnitudes. For tables of frequencies the optimitzation procedures to be
developed can also be applied followed by some heuristic post-process.
Current methods for tabular data protection can be classified as perturbative (they change the cell values) or nonperturbative (no change is performed). The most widely used nonperturbative method is cell suppression , where some secondary cells are removed to avoid the disclosure of some sensitive primary cells (which are removed as well). That results in a difficult combinatorial optimization problem, which finds the pattern of secondary suppressions that makes the table safe with a minimum number of cells or information loss. Some heuristics for two and three-dimensional tables [ 22 , 2,, 15, 4, 5] and exact methods for two-dimensional and general linked tables [ 18 , 19] have been suggested for the cell suppression problem. The main inconvenience of this approach is that, due to its combinatorial nature, the solution of very large instances (with possibly millions of cells) can result in impractical execution times.
Among the perturbative approaches, one of the techniques that received more attention was rounding . This method rounds cell values to a multiple of a fixed integer rounding base. Controlled rounding is a variant where the additivity of the table is preserved (i.e., rounded marginal values are the sum of the corresponding slice of internal rounded cells). Initially introduced in [ 1], efficient methods could only be developed for two-dimensional tables [ 9 ], possibly with subtotals [ 10]. For three-dimensional tables controlled rounding is a NP-hard problem [20 ]. Several heuristics [ 21] and exact approaches [ 23] were devised, but were only applied to small size tables. The NP-hardness of the approach makes it impractical for large tables, as the real-world ones to be tested in this project. Moreover, in practice it can be necessary to maintain some (possibly all) of the original total cells, instead of rounding them.
To avoid the above lacks of rounding, it was first introduced in [ 14] the controlled tabular adjustment method (CTA). Extensions to CTA have been recently considered for univarite and multivariate statistics [ 11]. Independently, a similar controlled perturbation method with a quadratic objective function was suggested in [ 3]. The resulting quadratic optimization problem is efficiently solved by an interior-point method in [6]. Both approaches find the minimum-distance (or closest) tables to those to be protected, preserving marginal values, if required, as well as any set of additional linear constraints. That means we try to minimize the information loss when delivering the perturbed values.
CTA and the quadratic minimum-distance controlled perturbation are essentially the same method. Both distances can be presented within an unified framework, including a new variant based on the infinity norm. Some of the benefits of presenting those distances under an unified framework are: (1) They can be easily compared. (2) A single general result about the disclosure risk of the distances can be developed independently of the objective function of the final formulation. (3) It is possible to combine the different distances in one single objective function [ 7 ].
The main features of CTA or minimum-distance controlled perturbation methods are:
Bibliography