Public consultation on the Privacy Act – Submission – Aleksandar Nikolov and Nicolas Papernot
Aleksandar Nikolov, Computer Science, University of Toronto
Canada Research Chair in Algorithms and Private Data Analysis
Faculty Fellow, Schwartz-Reisman Institute for Technology and Society
Faculty Affiliate, Vector Institute for Artificial Intelligence
Nicolas Papernot, Electrical Engineering and Computer Science, University of Toronto
Faculty Affiliate, Schwartz-Reisman Institute for Technology and Society
Member and Canada CIFAR AI Chair, Vector Institute for Artificial Intelligence
As residents of Canada and researchers whose work centres on issues of privacy in machine learning and data analysis, we welcome the opportunity to participate in the Government of Canada’s public consultation about the future of the Privacy Act. We are faculty members in University of Toronto’s Department of Computer Science, and are affiliated with the Schwartz Reisman Institute for Technology and Society, and the Vector Institute for Artificial Intelligence. We have both worked extensively in the area of privacy-preserving data analysis, and our submission draws on research in this field.
In this submission we focus on the issue of de-identifying personal information, in response to Section 7 of the discussion paper. The discussion paper defines de-identified information as “personal information that has been modified so that it can no longer be attributed to a specific individual without the use of additional information.” We argue that this definition is inadequate because it focuses on modifying personal information rather than the method of analysis of this information. To do so, we give examples of re-identification and reconstruction attacks against de-identified data. The attacks are powerful, robust, computationally efficient, and apply widely against data-centric approaches to de-identification. Their existence contradicts the discussion paper’s claim that re-identification of de-identified data is a mere anecdote, and illustrates the fact that the proposed “framework focussed on reducing risks by removing personal identifiers” is insufficient to give meaningful privacy protection for personal information. We conclude by giving a high-level non-technical overview of recent frameworks of privacy protection developed by researchers in computer science, AI, and statistics, which, unlike the approach proposed in the discussion paper, are centered around algorithms rather than the data themselves. Finally, we note that these frameworks reason about and limit possible leakages of privacy beyond mere re-identification. For instance, one may wish to limit the risk of recovering missing attributes from data records partially known to an adversary. In another example, it may be undesirable to leak the very presence of a data point in an analysis, whether this data point is de-identified or not, because the very fact this analysis is made in the data point can be privacy sensitive (e.g., participation in an analysis in the context of a healthcare application may suffice to indicate an individual has a particular medical condition).
Attacks against (possibly de-identified) data analyses.
The discussion paper proposes managing privacy risks by “removing personal identifiers.” There are many examples of how this approach usually fails, because seemingly non-identifying pieces of information can often be linked together to uniquely identify people. This observation goes back to pioneering work by Latanya Sweeney, who famously identified former Massachussettes Governor William Weld’s health records from his date of birth, gender, and zip code [S00]. This is known as a “reconstruction attack” (when the attacker reconstructs a data set containing personal information) or a “re-identification attack” (when the attacker can also link the personal information to identifiable individuals). To illustrate how it works, we will take a lesser-known and more recent example: the re-identification of the Myki data set by Culnane, Rubinstein, and Teague [CRT19].
Myki is a smart card-based public transit ticketing system used in the state of Victoria in Australia, not unlike the Presto card used in the Greater Toronto Area: travellers tap on when boarding a bus or train, tap off when exiting, and the system registers their trip and charges them accordingly. As part of a datathon in 2018, the Victoria government released a large data set of Myki usage information, which contestants could use to analyze the state’s public transit system. The data set contained information about trips: when and where a card was tapped on, when and where it was tapped off, and which bus or train line was used. By way of de-identification, card ID numbers were substituted with random strings, and, of course, no names or addresses of registered users were included. However, if the same card was used for several trips, then the same random string was listed for each trip.
The first step researchers took in breaking the anonymity of this data set feels surprisingly innocuous: they found their own cards in the data set. This is simple: an attacker can look up several times and places where you took a bus. For example, the times they took a bus to work on two consecutive work days, and one weekend trip. If there is only one card in the data set that was used for all of these trips, then this must be the attacker’s card. This works most of the time because, as the researchers verified, it turns out that only two trips are usually enough to identify a card uniquely.
While identifying one’s own card scarcely feels like a privacy violation, it enables a much more damaging second step: identifying the card of someone the attacker knows. For example, if the attacker knows which card in the data set is theirs, they can easily identify the card of a coworker. Perhaps the coworker takes the bus home at the same time as the attacker, and maybe they attended some work events together. The attacker can check which cards were used together with their own card at these times, and, again, it turns out that, more often than not, there is a unique match in the data set. Having identified the card of a coworker, one can find out any other trip they have taken: weekend trips, doctor visits, or other travel information that they probably expect to be private.
The researchers also showed that such privacy breaks can be extended further by linking the data set with publicly available information. They cross-referenced the tweets of Anthony Carbines, a member of the Victorian Legislative Assembly, with the data set, and, since Carbines often proudly tweets about the Victoria public transit system when he’s boarding a train, they could identify his card as well.
Let us draw several lessons from this attack:
- Anything is potentially personally identifying information, because combining a few innocuous pieces of information often identifies a person uniquely.
- Who does the identification matters. It is likely that whoever is curious about your private information already knows a lot about you. This knowledge can be leveraged to find out even more.
- De-identification is hard in a connected world. Twitter, LinkedIn, and other social networks and websites provide easily accessible information that an attacker can leverage to facilitate re-identifying people in a supposedly de-identified data set.
In many practical scenarios, an adversary will not have direct access to the data but instead to a by-product of the data. This could be, for instance, the output of a database query, or the prediction of a machine learning model trained on the data. One may hope that, because such by-products only contain aggregate statistical information, they are inherently privacy-preserving. This hope, unfortunately, does not bear out in reality, either.
As a first example, let us take reconstruction attacks from counts. Researchers and official statistics agencies often publish data summaries in the form of tables which contain counts of how many people satisfy some property: these can be summaries of surveys, or voting polls, or census data published by Statistics Canada or the US Census Bureau. Such counts feel safer, from a privacy perspective, than fine-grained data sets such as the Myki data set. They can, however, still enable an attacker to identify individuals. Imagine, for example, that someone poses two counting queries to the data set: “How many computer science professors at the University of Toronto smoke?” and “How many computer science professors at the University of Toronto who were not born in Bulgaria smoke?” Each count is (probably) a relatively large number, so would not seem like a threat in isolation, but subtracting the second number from the first identifies immediately whether one of the authors of this submission smokes.
A common defense against this type of differencing attack is to introduce some type of noise to the counts. This could be random noise generated from some probability distribution, or it could be the change in count values introduced by modifying the data. Pioneering work by Dinur and Nissm [DN03] has shown that, unless the amount of noise grows significantly with the number of counts known by the attacker, the private information of most individuals in the data set can still be reconstructed, even from the noisy counts. Dinur and Nissim’s result is very general, and applies whenever the attacker can get accurate enough answers to many uncorrelated counting queries, i.e., counting questions that ask about properties that do not overlap much. Moreover, far from being merely a theoretical threat, this attack was recently carried out against Diffix, a system specifically designed to answer counting queries while protecting privacy [CN20] (a discussion can be found also in the recent blog post). We note that Diffix goes far beyond merely removing personal identifiers: the system applies data modification methods, limits what queries can be made to the data, including disallowing queries using unique data attributes, and adds dependent noise to the answers it provides. Nevertheless, it doesn’t escape the power of the reconstruction attacks. Furthermore, the attacks are easy to execute at a low cost: Cohen and Nissim note that they could execute their attack in less than four seconds on a personal laptop.
The US Census Bureau has also conducted its own reconstruction experiments and concluded that reconstructing sensitive microdata from the public 2010 Decennial Census release is possible [GAM18]. In a presentation, the Census Bureau say they can exactly reconstruct the block sex, age, race, and ethnicity of 46% of the population, and this number increases to 71% if we allow age to be approximated to within a year. In response to these experiments, the US Census Bureau acknowledges that “advances in computing power and the availability of external data sources make database reconstruction and re-identification increasingly likely”, and has recognized that “its traditional disclosure avoidance methods are increasingly insufficient to counter these risks.” For this reason, for the 2020 Decennial Census, the Census Bureau has adopted a formal method of privacy protection, namely, differential privacy. We return to differential privacy further in this submission.
It is important to note here that, while reconstructing a data set containing personal information is not the same as re-identifying individuals, reconstruction makes re-identification significantly more likely. Once the data are reconstructed, they become vulnerable to the type of the re-identification attack explained above, using the example of the Myki data set. This is why the reconstruction experiments conducted by the US Census Bureau prompted the modernization of its disclosure avoidance methodology.
Next, we take the example of a machine learning (ML) application. Typically, a machine learning model is obtained by analyzing a set of data known as the training set. This set contains examples of inputs to the model along with the target “label” (some kind of meaningful information about the inputs) we expect the model to predict. For instance, the inputs could be information extracted from a medical record and the target label would indicate whether or not the corresponding patient would eventually be readmitted to the hospital after being discharged. Once the model is trained, we can then use it to make predictions on previously unseen test data for which we do not know the labels. That is, the model can predict whether or not the medical record of a new patient is indicative of them being at risk of later readmission to the hospital if they are discharged. We expect that if the ML model is presented with unseen test data that is close enough to the training data (in technical terms, from the same distribution), the model will generalize to this new test data and make correct predictions.
However, models are practically speaking almost always imperfect and tend to exhibit better performance on their training data than on their test data. This means that if one takes a particular piece of data, it is more likely that the model will make the right prediction on this data if the data was already analyzed during training than if it is the first time the model predicts on it.
This leaks private information, because observing the model’s predictions gives an adversary an advantage in determining the membership of a piece of data in the model’s training set. Imagine that the model comes from our running example of a medical application, so the model is trained on people who were all being treated at a hospital. Then inferring membership of a record in this training set reveals private information about the corresponding patient; namely, that they may suffer from a medical condition.
Let us briefly explain how these attacks work. The adversary can guess that a piece of data was included in the training data if the model makes a correct prediction and, conversely, that a piece of data was not included in the training data if the model makes an incorrect prediction. By making guesses in this manner, we are more likely to be right about a piece of data’s membership in the training set than if we were to randomly flip a coin and guess membership accordingly. This is because we’ve observed that the model is more likely to make correct predictions on its training data than on test data. Our recent research demonstrates that performing this attack only requires that the adversary be able to observe the label predicted by the model [CTC20], but the attack can be performed with fewer interactions with the model if the adversary can also observe the confidence of the model as it predicts [SSS17].
How do we move forward?
What this points to is that, rather than focusing on making a particular set of data private (e.g., through de-identification by removing personally identifying information), the scientific community has discovered that making an analysis technique (or algorithm) private provides more meaningful guarantees. Seeing a data set in isolation makes it hard, if not impossible, to decide whether the data are successfully de-identified. As we observed already, this depends on what an attacker trying to re-identify individuals knows, and what additional information may be available from other sources. Without knowing who may try to do the re-identification, and what information they possess, or which individuals or what new information they are interested in, we cannot decide if a data set is safe for publication. If we know, however, the method (i.e., the algorithm) through which the data was analyzed to produce some by-product of it, such as a table of counts or a machine learning model, then we can actually make guarantees that hold against any possible attacker, with any kind of side information. This insight was developed initially in work by Dwork, McSherry, Nissim, and Smith [DMNS06], who introduced a seminal framework for private data analysis known as differential privacy. Differential privacy has seen an increasing number of recent adoptions, both in industry (by Google, Apple, Facebook, among others) and by official statistics agencies, most notably the US Census Bureau starting with the 2020 Decennial Census.
As we mentioned, differential privacy defines privacy with respect to an algorithm used to analyze data, rather than with respect to the data themselves. As a result, it enables one to deploy privacy-preserving analysis on data that has not been de-identified and release the outputs of this analysis while providing strong guarantees of privacy.
Informally, the definition can be described using a thought experiment. Imagine that we execute the algorithm on a data set, and we also execute it on the data set with the data of one individual modified. To be differentially private, the algorithm must behave almost identically in these two scenarios, and this must happen for all data sets and all modifications to the data of any one individual. For example, a differentially private algorithm must have the property that if it produces a particular machine learning model from a data set that includes your data, then it is almost as likely to produce the same model if your data were removed or changed completely. This means that an adversary observing or using the model output by this algorithm is unable to tell whether your data (or any other person’s data) were used to train the model at all. Then, whatever information an attacker may learn about you from observing the algorithm’s output could have also been learned even if your data were never seen by the algorithm.
Differentially private algorithms have the property that they reveal statistical information about the data, such as, for example, correlations between genetic factors and medical conditions, but do not reveal information about particular individuals, such as whether they have some medical condition. This is because statistical information does not depend crucially on any one person’s data, while private personal information does. We should note that the actual technical definition is quantitative, and the level of privacy protection is tunable, allowing us to trade privacy off for the machine learning model’s accuracy, or to query answers produced by the algorithm.
Let us illustrate this with a short example of how differential privacy can be achieved. Suppose we just want to release two counts, such as our previous examples “How many computer science professors at theUniversity of Toronto smoke?” and “How many computer science professors at the University of Toronto who were not born in Bulgaria smoke?” Then we can add some random noise to each count, for example by flipping some number of coins and adding the number of tails we get to the first count, and doing the same with new coins for the second count. Now, even if an attacker subtracts the two noisy counts from each other, the result they get will be dominated by the random noise, and they cannot tell if this blog post’s authors smoke. The more coins we flip, the better the privacy protection. At the same time, we can achieve good privacy protection with many fewer coin flips than the number of computer science professors at the University of Toronto, giving us reasonably accurate answers to the two counting questions.
Conclusion.
Given what the scientific community has learned about the risks of re-identification attacks, and about defining privacy protection rigorously, we advocate that de-identification be interpreted in terms of how data is analyzed. Combined with proper access control measures to ensure that the data themselves are not directly accessible, analyzing data with algorithms that provide rigorous guarantees of privacy allows us to envision a future where respectful, yet nevertheless useful, applications of data analysis are possible. Such a modern perspective on protecting personal information should be a crucial component of a reformed Privacy Act.
[CN] Aloni Cohen, Kobbi Nissim. 2020. “Linear Program Reconstruction in Practice”. Journal of Privacy and Confidentiality 10 (1).
[CTC20] Christopher A. Choquette Choo, Florian Tramer, Nicholas Carlini, Nicolas Papernot. Label-Only Membership Inference Attacks.
[CRT19] Chris Culnane, Benjamin Rubinstein, Vanessa Teague. Stop the Open Data Bus, We Want to Get Off
[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, Adam Smith. 2006. Calibrating Noise to Sensitivity in Private Data Analysis. TCC 2006
[DN03] Irit Dinur, Kobbi Nissim. Revealing information while preserving privacy. PODS 2003.
[GAM18] Simson Garfinkel, John Abowd, Christian Martindale. Understanding Database Reconstruction Attacks on Public Data. ACMQueue, Vol. 16, No. 5 (September/October 2018): 28-53.
[MN12] Aleksandar Nikolov, S. Muthukrishnan, Optimal Private Halfspace Counting via Discrepancy. STOC 2012.
[S00] Latanya Sweeney. Simple demographics often identify people uniquely. Health 671, 1–34 (2000).
[SSS17] Reza Shokri, Marco Stronati, Congzheng Song, Vitaly Shmatikov. Membership Inference Attacks Against Machine Learning Models.
- Date modified: