Very large scale optimization for data privacy

Research project MTM2009-08747, Spanish Ministry of Science and Innovation. January 1 2010-December 31 2012.


Introduction and objectives

National Statistical Agencies are forced by law to control the disclosure of individual confidential information from published data. Prior to dissemination, data must be processed by some protection method. This project is about tabular data protection methods, i.e., methods that deal with the trade-off between utility and disclosure risk of aggregated data. One of these methods, minimum distance Controlled Tabular Adjustment (CTA), suggested by the research group, has shown to be more efficient than alternative tools. This technique formulates a linear (for L1 distance) or quadratic (for L2) optimization problem. They become mixed integer linear or quadratic problems if binary decisions about sensitive cells are included in the model.

Previous results showed that polynomial-time interior-point methods are very effective for the solution of the resulting linear or quadratic optimization problems. For some structured tabular data, specialized methods developed by the group outperformed state-of-the-art commercial solvers by a factor of 500. However, for general unstructured and massive data (resulting in optimization problems of several millions of variables and few millions of constraints), general interior-point solvers based of Cholesky factorizations may show a poor performance. The above only applies to a single table. If one wants to jointly protect the full set of tables from, e.g., a particular census, (which is not done, but it would be the safest approach), the resulting problem is in the limits (if not beyond) of capabilities of current optimization technology. One of the project goals is to fill this gap, developing and testing interior-point algorithms based on iterative solvers for this massive optimization problems.

In addition, in a recently finished project we participated, the mixed integer version of CTA was used by Eurostat (Statistical Agency of the European Commission) within a wider scheme for the protection of structural bussiness statistics of European companies. Though the dimension of instances was not very large, the inclusion of binary variables in the model exhausted branch-and-cut of current state-of-the-art optimization solvers. The project will also consider the solution of these discrete optimization problems by Benders reformulation, Lagrangian relaxation, and specific cuts generation (e.g., perspective cuts for quadratic objectives).

An alternative to mixed integer CTA could be the interval protection method. Though this approach formulates a continuous optimization problem, the number of variables is of order square of cells in the table, and the number of constraints is of order number of cells times number of table relations. For a moderate table it may result in optimization problems of about 1e+8 variables and 1e+7 constraints. The constraints matrix exhibits a dual block-angular structure. We will approach it by means of interior-point methods for structured problems.

The main objectives of the project are thus: 1) Development of interior-point algorithms based on iterative solvers for very large general unstructured problems. 2) Development of interior-point algorithms for massive dual block-angular problems. 3) Development and application of improving strategies for the above approaches: preconditioners, regularizations. 4) Solution of the mixed integer CTA problem by procedures like Benders reformulation, Lagrangian relaxation, perspective cuts, coordinate descent.


Click here  for a list of papers and conference presentations generated within the project.