Foiling the Data Snoopers
Setting Proper Boundaries on Shared Information
In the past decade many businesses and organizations that compile large amounts of data have used a “data cube” format for managing it. Different pieces of information are entered as separate cells or cubes within the larger cube, and the overall cube can be made available after screening out the separate cells with sensitive information.
The format has great practical utility, allowing, for example, the U.S. Census Bureau to aggregate massive amounts of data for public review without compromising personal information, or allowing a company in a joint venture with another firm to share necessary information without revealing proprietary data.
Underlying the data cube format is a critical security question: What are the boundaries of information that can be released without allowing a data snooper to get at the blocked-out sensitive information by drawing inferences from the data made available? Haibing Lu, who joined the Leavey School of Business fall quarter 2011 as an assistant professor of operations management and information systems (OMIS), co-authored, with Yingjiu Li of Singapore, one of the pioneering papers in that area of inquiry. “Practical Inference Control for Data Cubes,” their study, was published by the IEEE Computer Society.
Those formatting the data cube must think like the bad guys.
“It’s a very challenging problem,” Lu says, “and before our paper, the privacy issue with data cubes had not really been studied in depth.”
As a hypothetical example of how the problem works, Lu posits a health survey of a large employer. If the released data showed that five people in a firm of several thousand had been treated for an ailment, their privacy most likely wouldn’t be compromised. If it showed that five people in a smaller department within the organization had been treated, someone with knowledge of that department might be able to figure out who they were. The question then becomes where, between those two points, should the boundary be set for the data that will be made available?
Drawing that boundary line in the right place involves a two-step process. The first step is to define and clearly understand the standard of privacy that has to be maintained. The second is for people formatting the data cube to think like the bad guys. “You want to attack your own system,” Lu says. “If I’m doing it, I want to see if I can break this system, and what is it that enables me to attack the system.”
Before Lu and Li published their paper in the journal of the IEEE Computer Society, most of the people working with data cubes had been using a set of standards known as the Frechet bounds to determine the upper and lower levels of information boundaries in each cell value within the larger data cube, taking into account the aggregations of cell values over multiple dimensions within the cube.
In their paper Lu and Li propose an improvement on those bounds, based on a complex formula. Their approach produces bounds at least as good as the Frechet system, but with less time complexity needed to reach a determination.
With the dramatic increase in computer power and data availability through the Internet, this will be an ongoing field of research with substantive practical implications, particularly for larger firms and organizations, Lu says.
“AOL had a serious problem a while back when it posted information about user queries that allowed snoopers to identify some users. It’s an issue that large government organizations have to deal with. And companies like Google and Yahoo are taking a hard look at these issues because they know that if there’s a serious privacy breach, there will be a big lawsuit.”