No such thing as anonymous data
1st October, 2019
6 min read
New York City publicly released anonymised data of almost two million taxi rides - including fares and tips - in order to facilitate research into transport, traffic, and civic planning by independent researchers. However, they made a mistake - they only replaced the tax medallion number with the MD5 hash of the number. MD5 hashes are computationally weak which allowed a privacy researcher Vijay Pandurangan to re-identify every taxi driver in the dataset exposing their home address, and income.
Not only that, other researchers showed that the addresses of celebrities could be exposed by cross-referencing images of them entering or leaving taxis in paparazzi pictures.
Despite the fact that the cause of this disastrous privacy breach was actually due to using a weak hashing algorithm to mask the highly sensitive taxi medallion number, as Cynthia Dwork says
de-identified data isn't
By this she means that either the data is not de-identified (anonymous), or is not data!
Even if efforts are made to achieve anonymised datasets - masking information, generalising, or even deleting directly and indirectly identifying information, privacy can still be jeopardised. Even if everything is done perfectly with no mistakes, privacy is never guaranteed. Once a dataset is public, it is open to privacy attacks including re-identification attacks, differencing, record linkage - all of which I'll briefly explain the various attacks below.
Any dataset is also open to attacks from unknown or future attacks. A dataset might have been processed using rules that should guarantee privacy, but that guarantee has either failed immediately or will fail in the near future either because of increases in computing power, new techniques, or new datasets from unrelated organisations that can be used to reveal identities.
If you have some knowledge about an individual, you can use that to examine multiple statistics in which the individual's data is included. In short, a differencing attack can occur where multiple questions on their own do not risk re-identifying an individual, but if two or more of the questions are asked they risk exposing an individual's identity.
For instance, if we know an individual is male, and vegetarian and attends a dinner where aggregate data of attendees is released, we can directly identify the individual if he is the only man to order a vegetarian meal. This may seem trivial, but the full data record from the event for the individual could be leveraged in a separate attack in different datasets.
This is a trivial example, easily noticed and understood by humans. Algorithms can quickly process large datasets, combining multiple datapoints, easily exposing an individual's details. A simple task for the algorithm but impossible for a human to notice and avoid before a dataset is released.
If an attacker knows even a single piece of information about an individual, they can use that to identify an individual's information in a publicly available, anonymised dataset.
In 2006 Netflix published their user's movie-ranking information. The data had been anonymised - de-identified - by replacing the users names with random numbers, and changing their personal details. Arvind Narayanan and prof. Vitaly Shmatikov managed to re-identify individuals in the dataset using publicly available IMDB movie ratings. In fact, it was discovered that knowing the date of two public movie reviews was enough to have a nearly 70% chance of re-identifying an individual in the so-called anonymised dataset.
Record linkage happens when an attacker can connect anonymised, often unrelated, datasets in order to reveal the identity of an individual. This is possible when datasets contain the same information about an individual, called indirect identifiers, or quasi-identifiers.
Quasi-identifiers do not directly identify an individual, but when combined with other quasi-identifiers will identify that individual.
A famous example of a linkage attack ocurred in Massachussets in 1997. There was some controversy over patient information released publicly, despite assurances from the governer William Weld that all direct identifiers had been deleted. Famously Latanya Sweeney took this publicly available, and supposedly anonymised data, and compared it with another publicly available dataset - the voting register, which she obtained for $20. With these two sets of data she was able to find the Governer's personal medical records. The reason this attack succeeded was the patient's zip code, date of birth, and sex, were untouched in the dataset of hospital attendance.
This last attack is probably the most dangerous because it works not by using the dataset directly, but by relying on statistics from the dataset. Researchers often release statistical aggregations from their publicly available datasets - percentage of people who are married, single, divorced, or income bands, gender, etc. These seem innocuous, but every piece of aggregated data filters the possible set of records that could have contributed to that statistic.
Simson Garfinkel - Senior Scientist on the U.S. Census Bureau team for disclosure avoidance - showed that it is possible to reconstruct the personal data of individuals from a summary of the mean and median age, and frequency count broken down by some demographics - gender, income, marital status.
Even if the source dataset is never exposed to the public, the more statistics that are released, the more individual's details will be exposed.
Is honesty the end of privacy?
To gain value from gathered data the answers must be accurate, however as Cynthia Dwork and Aaron Roth point out in "The algorithmic foundations of Differential Privacy"
"overly accurate answers to too many questions will destroy privacy in a spectacular way"
What is the answer then? If everyone lies, all data gathered is useless, which is bad news for your you and your doctor. If you do let him know it hurts when you cough then you'll get proper treatment, but how do you know your personal data is safe? What if it helps others to share statistics your data contributes to? There is a choice to be made between never sharing your data or allowing it to be shared in a safe way with an informed choice on how much privacy risk you'll face for sharing different levels of detail.
Differential Privacy offers a solution to the problem of balancing privacy and generating useful statistical output from datasets - something I'll explain in another article.