Data Protection for Pew Research Center’s Asian American Survey

This photo features the hands of a skilled hacker or programmer seamlessly working on a laptop, epitomizing focus, skill, and precision in the digital realm. Ideal for conveying concepts of cybersecurity

Problem

The Pew Research Center wanted to release a public use file (PUF) for the Asian American Survey, which required the file be anonymized.

Public use files (PUFs) are data sets made publicly available to provide external researchers with access to the data. A common concern with such files is whether the information released could be used to identify respondents, known as respondent re-identification, as these files often contain sensitive information that may harm the respondent if their identity became known. The Pew Research Center (the Center) needed a thought leader on the topic to review the Asian American Survey and recommend privacy-preserving methodology, with the goal of creating an anonymized PUF that minimized or eliminated the risk of respondent reidentification while maintaining analytical results that are reasonably close to the actual data. Each prior release underwent disclosure limitation. However, the solution for the latest data released needed to account for PEW’s previously released summarized results such as summary statistics and tabulations.

Solution

NORC used a combination of sampling, recoding, post-randomization (PRAM), and partial synthesis to protect respondents on the PUF.

NORC used sampling as its primary form of disclosure protection for the PUF. We chose this method as counts from the subsampled file will inherently differ from those on previously released estimates, alleviating the difficulties posed by the summarized results. Recodes, partial synthesis, and PRAM were used to further protect demographic and sensitive variables on the PUF.

To implement this approach, NORC pulled 70 subsamples from the original responding sample of approximately 7,000 respondents. The 70 subsamples included samples both with and without replacement, used simple random sampling and stratified random sampling, and had four different sizes; 1,000; 2,000; 3,500; and 7,000 (size 7,000 was only for samples where sampling with replacement was utilized). The datasets were then tested in terms of both disclosure risk and data utility. For disclosure risk, the probability of attribution for statistically unique records was used, checking that no series of responses from an individual could be used to identify that individual. For data utility, propensity scores were used to ensure key distributions were comparable between the original dataset and the PUF. Upon selecting a sample to release, the data was re-weighted to ensure that the estimates coming from the subsample would be sufficiently similar to the original sample.

Result

NORC’s efforts allowed the Center to release an Asian American Survey PUF with minimal re-identification risk.

Using a sample of records from a file for data release enhances data protection properties and opens the possibility of strong privacy protection for large files while maintaining good data utility properties. By including sampling as a key component of the disclosure limitation plan, NORC greatly reduced the re-identification risk of survey participants. The use of resampling, along with the additional disclosure limitation techniques discussed above, enabled the Center to release a PUF for the Asian American Survey that minimized risk to the respondents while maintaining utility close to the actual data set.