intro_topde

INTRODUCTION

The safe dissemination of data is one of the main concerns of National Statistical Agencies. The released data can be classified as disaggregated or aggregated. Disaggregated data (a.k.a. microdata or microfiles) consists of files of records, each record providing the values for a set of variables of an individual. Aggregated data (a.k.a. tabular data) is obtained from microdata crossing two or more variables, which results in sets of tables with a likely large number of cells. It must be guaranteed, for both types of data, that no individual information can be derived from the released information. The available methods for this purpose belong to the field of statistical disclosure control. Good introductions to the state-of-the-art in this field can be found in the monographs [ 25, 16].

In the project we first and mainly focus on tabular data protection. Although each cell of the table shows aggregated information for several individuals, there is a risk of disclosing individual data. This is clearly shown in the below tables. The table (a) gives the average salary for age interval and ZIP code, while table (b) shows the number of individuals for the same variables. If there was only one individual in ZIP code z2 and age interval 51-55, then any external attacker would know the salary of this single person is 40000EUR. For two individuals, any of them could deduce the salary of the other, becoming an internal attacker. Usually, cells showing information about few individuals are considered sensitive, although other rules can be used in practice. Methods for detecting sensitive cells are out of the scope of this project. A recent discussion about sensitivity rules can be found in [ 17, 24].

	...	z1	z2	...
...	...	...	...	...
51-55	...	38000EUR	40000EUR	...
56-60	...	39000EUR	42000EUR	...
...	...	...	...	...
(a)

... z1
z2
...

... .. ... ... ...

51-55 ... 20 1 or 2 ...

56-60 ... 30 35 ...

... ... ... ... ...

(b)

The above tables show a two-dimensional example. This can be considered the simplest case. However, in practice we must deal with more complex situations, including multidimensional, hierarchical and linked tables. Multidimensional tables are obtained crossing more than two variables, and they can be individually protected. Hierarchical tables are sets of tables whose variables have a hierarchical relation (e.g., ZIP code and city). In that case, the total or marginal cells of some tables are internal ones for the others. They have to be protected together, to avoid the disclosure of sensitive data. Finally, linked tables are a generalization of the previous situation, where several tables are made from the same microdata, thus sharing information or cells, either hierarchical or not. Again, they have to be protected together. Linked tables can deal with any table dimension, size and structure, and thus include the other situations. Dealing with linked tables is a desired feature of any tabular protection method. Eventually, the final goal would be the protection of the whole set of linked tables that can be produced from some microfiles (e.g., a population census). Clearly, the number of cells involved in that case might be of several millions, an impractical size for most current tabular protection techniques. The family of protection methods considered in this project deal with linked tables, and can solve real-world large instances in few seconds. All the above situations can both refer to frequency tables (i.e., cell values are integer and are usually associated to the number of individuals in that cell) or magnitude tables (i.e., cell values are real, and, for instance, they show the mean for some other variable of all the individuals in that cell). In the project we mainly focus on tables of magnitudes. For tables of frequencies the optimitzation procedures to be developed can also be applied followed by some heuristic post-process.

Current methods for tabular data protection can be classified as perturbative (they change the cell values) or nonperturbative (no change is performed). The most widely used nonperturbative method is cell suppression , where some secondary cells are removed to avoid the disclosure of some sensitive primary cells (which are removed as well). That results in a difficult combinatorial optimization problem, which finds the pattern of secondary suppressions that makes the table safe with a minimum number of cells or information loss. Some heuristics for two and three-dimensional tables [ 22 , 2,, 15, 4, 5] and exact methods for two-dimensional and general linked tables [ 18 , 19] have been suggested for the cell suppression problem. The main inconvenience of this approach is that, due to its combinatorial nature, the solution of very large instances (with possibly millions of cells) can result in impractical execution times.

Among the perturbative approaches, one of the techniques that received more attention was rounding . This method rounds cell values to a multiple of a fixed integer rounding base. Controlled rounding is a variant where the additivity of the table is preserved (i.e., rounded marginal values are the sum of the corresponding slice of internal rounded cells). Initially introduced in [ 1], efficient methods could only be developed for two-dimensional tables [ 9 ], possibly with subtotals [ 10]. For three-dimensional tables controlled rounding is a NP-hard problem [20 ]. Several heuristics [ 21] and exact approaches [ 23] were devised, but were only applied to small size tables. The NP-hardness of the approach makes it impractical for large tables, as the real-world ones to be tested in this project. Moreover, in practice it can be necessary to maintain some (possibly all) of the original total cells, instead of rounding them.

To avoid the above lacks of rounding, it was first introduced in [ 14] the controlled tabular adjustment method (CTA). Extensions to CTA have been recently considered for univarite and multivariate statistics [ 11]. Independently, a similar controlled perturbation method with a quadratic objective function was suggested in [ 3]. The resulting quadratic optimization problem is efficiently solved by an interior-point method in [6]. Both approaches find the minimum-distance (or closest) tables to those to be protected, preserving marginal values, if required, as well as any set of additional linear constraints. That means we try to minimize the information loss when delivering the perturbed values.

CTA and the quadratic minimum-distance controlled perturbation are essentially the same method. Both distances can be presented within an unified framework, including a new variant based on the infinity norm. Some of the benefits of presenting those distances under an unified framework are: (1) They can be easily compared. (2) A single general result about the disclosure risk of the distances can be developed independently of the objective function of the final formulation. (3) It is possible to combine the different distances in one single objective function [ 7 ].

The main features of CTA or minimum-distance controlled perturbation methods are:

Efficient: real-world large instances can be solved in few seconds using current linear and quadratic programming technology. For very large instances, specialized linear and quadratic programming tools are needed; this is the main purpose of the project.

Versatile: they deal with any table or set of tables, and with any additional linear constraint (e.g., preserving the value of total cells).

Safe: even with partial information, an attacker is not able to reproduce the original data.

Simple: they have a straightforward derivation and formulation. That is a very appreciated feature by National Statistical Agencies' staff, which tend to avoid methods based on sophisticated procedures [ 13]].

Alternative approaches for tabular data protection have flaws in some of the above features.

Bibliography

[1] Bacharach, M. (1966), "Matrix Rounding Problems," Management Science, 9, 732-742.

[2] Carvalho, F.D., Dellaert, N.P., and Osório, M.D. (1994), "Statistical Disclosure in Two-Dimensional Tables: General Tables," Journal of the American Statistical Association , 89, 1547-1557.

[3] Castro, J. (2002), internal communication to partners of the European Union IST-2000-25069 CASC project.

[4] Castro, J. (2002), "Network Flows Heuristics for Complementary Cell Suppression: an Empirical Evaluation and Extensions," in Lecture Notes in Computer Science. Inference Control in Statistical Databases (Vol. 2316), ed. J. Domingo-Ferrer, Berlin: Springer, 59-73.

[5] Castro, J.(2004), "A Fast Network Flows Heuristic for Cell Suppression in Positive Tables," in Lecture Notes in Computer Science. Privacy in statistical databases (Vol. 3050), eds. J. Domingo-Ferrer and V. Torra, Berlin:Springer, 154-168.

[6] Castro, J. (2004), "Quadratic Interior-Point Methods in Statistical Disclosure Control," Computational Management Science, in press.

[7] Castro, J. (2004), "Computational Experiments with Minimum-distance Controlled Perturbation Methods," in Lecture Notes in Computer Science. Privacy in statistical databases (Vol. 3050), eds. J. Domingo-Ferrer and V. Torra, Berlin:Springer, 84-99.

[8] Cox, L.H. (1995), "Network Models for Complementary Cell Suppression," Journal of the American Statistical Association, 90, 1453-1462.

[9] Cox, L.H., and Ernst, L.R. (1982), "Controlled Rounding," INFOR , 20, 423-432.

[10] Cox, L. H., and George, J. A. (1989), "Controlled Rounding for Tables with Subtotals," Annals of Operations Research , 20, 141-157.

[11] Cox, L. H., Kelly, J. P., and Patil, R. (2004). "Balancing quality and confidentiality for multivariate tabular data", in Lecture Notes in Computer Science. Privacy in statistical databases (Vol. 3050), eds. J. Domingo-Ferrer and V. Torra, Berlin:Springer, 100-111.

[12] Dandekar, R.A. (2003), "Cost Effective implementation of Synthetic Tabulation (a.k.a. Controlled Tabular Adjustments) in Legacy and New Statistical Data Publication Systems," presented at the Joint ECE/Eurostat Work Session on Statistical Data Confidentiality, Luxembourg. Available online from http://www.unece.org/stats/documents/2003.04.confidentiality.htm .

[13] Dandekar, R.A. (2003), personal communication.

[14] Dandekar, R.A., and Cox, L.H. (2002), "Synthetic Tabular Data: an Alternative to Complementary Cell Suppression," manuscript, Energy Information Administration, U.S. Department of Energy. Available from the first author on request ( Ramesh.Dandekar@eia.doe.gov ).

[15] Dellaert, N.P., and Luijten, W.A. (1999), "Statistical Disclosure in General Three-Dimensional Tables," Statistica Neerlandica, 53, 197-221.

[16] Domingo-Ferrer, J. (ed.) (2002), Lecture Notes in Computer Science. Inference Control in Statistical Databases (Vol. 2316), Berlin: Springer.

[17] Domingo-Ferrer, J., and Torra, V. (2002), "A Critique of the Sensitivity Rules Usually Employed for Statistical Table Protection," International Journal of Uncertainty Fuzziness and Knowledge-Based Systems , 10(5), 545-556.

[18] Fischetti, M., and Salazar, J.J. (1999), "Models and Algorithms for the 2-dimensional Cell Suppression Problem in Statistical Disclosure Control," Mathematical Programming, 84, 283-312.

[19] Fischetti, M., and Salazar, J.J. (2001), "Solving the Cell Suppression Problem on Tabular Data with Linear Constraints," Management Science , 47(7), 1008-1026.

[20] Kelly, J. P., Assad, A. A., and Golden, B. L. (1990), "The Controlled Rounding Problem: Relaxations and Complexity Issues," OR Spektrum , 12, 129-138.

[21] Kelly, J. P., Golden, B. L., and Assad, A. A. (1990), "Using Simulated Annealing to Solve Controlled Rounding Problems," Annals of Operations Research , 2(2), 174-190.

[22] Kelly, J.P., Golden, B.L, and Assad, A.A. (1992), "Cell Suppression: Disclosure Protection for Sensitive Tabular Data," Networks , 22, 28-55.

[23] Kelly, J. P., Golden, B. L., Assad, A. A. and Baker, E. K. (1990), "Controlled Rounding of Tabular Data," Operations Research , 38(5), 760-772.

[24] Robertson, D.A., and Ethier, R. , "Cell Suppression: Experience and Theory," in Lecture Notes in Computer Science. Inference Control in Statistical Databases (Vol. 2316), ed. J. Domingo-Ferrer, Berlin: Springer, 8-20.

[25] Willenborg, L., and de Waal, T. (eds.) (2000) Lecture Notes in Statistics. Elements of Statistical Disclosure Control (Vol. 155), New York: Springer.

	...	z1	z2	...
...	..	...	...	...
51-55	...	20	1 or 2	...
56-60	...	30	35	...
...	...	...	...	...
(b)