In the digital age, data is ubiquitous. The increasing ease with which every online interaction can be stored, compared and analyzed has transformed a wide swath of business and led to the formation of new ones. The same principles apply to the scientific world, where fishing correlations from the sea of millions or billions of data points promises to reveal new insights about the genetics of disease or the properties of never-before-seen molecules.
While a future where algorithms allow taxis to know where riders will be, online stores to know what new items shoppers will want, and medical organizations to know where an outbreak will next strike sounds ideal, the other side of this coin is that not everyone wants the world to know what these algorithms might reveal.
The stakes are even higher in the context of national security. Network analysis techniques that comb through an individual’s social connections are particularly suited to the threats of the 21st century, but are also particularly pernicious when it comes to issues of surveillance and privacy.
Researchers at Penn’s Warren Center for Network and Data Sciences are exploring how the algorithms that conduct this kind of analysis can be designed to guarantee certain privacy protections.
While agencies that perform such analyses would still have access to private data, and would need to voluntarily agree to adopt such algorithms, doing so would provide mathematical certainty that information about people accidentally caught up in such searches could not be revealed.
Michael Kearns, co-director of the Warren Center and National Center Professor of Management & Technology in the School of Engineering and Applied Science, fellow CIS Assistant Professor Aaron Roth, their student Steven Wu, and Warren Center postdoc Grigory Yaroslavtsev, recently published a study in the Proceedings of the National Academy of Sciences that outlines the new algorithms.
The impetus for this project came from a National Research Council study on which Kearns served. Tasked with exploring technological alternatives to bulk data collection in the context of national security, the committee’s conclusion was that no technology could fully replace bulk collection with a more targeted approach.
Seeing the possibility of a middle ground, Kearns and his fellow Warren Center researchers set off to find a mathematical way of balancing the needs of the counterintelligence community and public fears of mass surveillance.
“We thought we could inject some science into such discussions by allowing trade-offs between privacy and utility,” Kearns says.
“In other words,” Roth adds, “how can we have as little intrusion as possible while still being able to catch terrorists?”
In brainstorming how to go about this, the researchers began investigating whether they could apply “differentially private” algorithms to graph search, a type of analysis that crawls through a network of phone calls between suspected terrorists to look for accomplices, or the network of interactions between people with an infectious disease to predict where an outbreak might occur.
Roth, along with colleagues from academia and the computer industry, has been fleshing out concepts behind differential privacy for several years. The “different” in its name is a reference to the guarantee a differentially private algorithm makes. Its analyses should remain statistically nearly identical when applied to two different datasets: one with and one without the data from any single individual.
“The math behind differentially private algorithms gives you a provable guarantee that, even if you know everything else about the dataset, you can’t learn anything new about my data in particular,” Roth says. “The algorithm essentially can’t tell whether my data is in the data set in the first place.”
Applying a differentially private algorithm would prevent idiosyncratic information from being reverse-engineered from the data set. In the simplest case, this kind of reverse engineering would be like deducing someone’s salary by looking how their company’s average salary changed after they were hired. Differentially private algorithms add noise such that the aggregate analysis—the average salary—remains statistically accurate, while obscuring individual data enough that the analysis can’t be used to identify the role of single person.
When it comes to network analysis, the key data is where an individual is located in the network and what other individuals they are linked to. A typical graph search technique, known as “contact chaining,” might start at the node of an individual known to be in a group that is the target of an investigation, and then crawl through the connected nodes to see whether they are part of the targeted group as well. The search might also go further, probing through the nodes two or more connections removed from the starting point.
Privacy issues begin to arise when individuals who aren’t being targeted find themselves, through no fault of their own, sandwiched between individuals who are. Arresting the targeted people on both sides could reveal information about the person in the middle if they are the only link that connects them. Performing such searches using a differentially private algorithm would necessarily obscure and protect those people.
As an example, consider Alice, a doctor who specializes in a rare disease, and Betty, a patient who lives in the same small town. Betty has another doctor in the family, Charles, and so has different reasons for being in touch with both Charlie and Alice. Authorities, however, are investigating Charlie’s practice for filing fraudulent medical procedure claims to insurance companies, and by crawling through Betty’s node, discover that members of Alice’s practice may separately be committing the same kind of fraud. By arresting both Alice and Charlie, it might be inadvertently revealed that Betty has the rare disease, as she was the only link between the two groups, and it is generally known where she lives and that she is related to Charlie.
“It’s crucial to note that Betty is innocent of any bad activity here,” Kearns says. “She was just unfortunate enough to be the only one to have legitimate connections to bad actors in the two groups. If instead there were many innocent ‘bridges’ between the groups, it would be OK to arrest both, because each bridge has plausible deniability. Our paper is making this idea precise and quantifying it algorithmically.”
While this approach can’t address thorny questions about who had access to which individual’s data, or how to determine which traits qualify someone for inclusion in the targeted group, the Penn researchers see this as an important step for establishing safeguards in these increasingly common practices.
It’s also the kind of scientific question the Warren Center was created to answer.
“This kind of challenge is really the best case scenario for the Warren Center,” Kearns says. “We’re bringing people together to work on something that’s a network problem, a data problem, and a problem of real societal concern, all at the same time.”