FAQ: Impact of the Census Disclosure Avoidance System

Answers to common questions about our recently-released report evaluating the Census’ Disclosure Avoidance System.

Authors
Affiliations

Christopher T. Kenny

Department of Government, Harvard University

Shiro Kuriwaki

Department of Government, Harvard University

Cory McCartan

Department of Statistics, Harvard University

Evan Rosenman

Harvard Data Science Initiative

Tyler Simko

Department of Government, Harvard University

Kosuke Imai

Departments of Government and Statistics, Harvard University

Published

June 2, 2021

We have received a large number of questions and suggestions since releasing the first version of our report on Friday, May 28th. After meeting the Census Bureau’s one month deadline to produce comments on DAS-protected data, we are continuing to undertake additional analyses of the impacts of DAS on the redistricting process and analyses. We are working towards releasing the revised report soon. In the meantime, we provide our current answers to the questions frequently asked by others below.

Are you advocating for the use of swapping methods over differential privacy as a method of privacy protection?

No. We have not studied the impact of swapping on redistricting. It is well known that differential privacy (DP) is theoretically superior to swapping in terms of privacy protection. The key policy question is how much privacy protection we would want at the expense of accurate census measurements. We argue that when changing an important public policy, such as whether additional noise should be added to the census, one needs to carefully determine the impacts of such policy change. We do not argue against DP and believe additional privacy protections to be a worthy consideration for some of the Census variables and products. However, we find that the current implementation of DAS can lead to substantively important changes in redistricting outcomes while also not meaningfully protecting privacy (as demonstrated by the fact that the prediction of individual race remains accurate even with the DAS data). We believe that more studies are needed to determine whether to inject noise and if so how exactly noise should be added. We hope that our study, along with many others, is the first step in answering this important question, by showing the potential consequences of the current DAS on certain redistricting outcomes.

Does the DAS noise “cancel out” at larger geographic scales?

Theoretically yes, but with a caveat. The DAS is designed to have less noise at larger geographic scales, especially for levels of geography in the Census hierarchy, known as “on-spine” geographies (blocks, block groups, tracts, and counties). But geographies like Census places and voting precincts (VTDs) are “off-spine”. The design of the DAS may induce noise that does not cancel out in off-spine geographies.

Our analysis finds that not only is there more noise for VTDs, but also that there remains a particular form of previously undiscussed bias — perhaps an unintentional side-effect of the DAS post-processing procedure needing to satisfy accuracy constraints in on-spine geographies. We find that the DAS data systematically undercounts racially and politically diverse VTDs in comparison to more homogeneous VTDs. How these discrepancies add up into legislative districts clearly depends on the spatial adjacency of diverse and homogenous VTDs. But in some cases the bias does not cancel out. In Pennsylvania, the average Congressional district changes by only 400 or so people. But the majority-Black 3rd Congressional District gains around 2,000 people under the DAS-protected data, while the more diverse 2nd Congressional District, a Black-Hispanic coalition district, loses around 2,000 people.

Will the noisy data make it harder to create partisan gerrymanders?

In general, no. Partisan gerrymanders can be made with election data and voter files, both of which are not part of the Census and so will not contain any DAS-based error. Other residential and political data sources (both public and private) exist as well, which could be used to draw skewed districts. However, for analysts who must use noisy Census data to identify and analyze a potential gerrymander, the noise and the partisan biases discussed above may make it more difficult to do so.

Why is using a simulation-based method useful to assess DAS? What are the downsides?

Without the use of simulation, we are restricted to verifying changes in enacted redistricting plans. Analyzing enacted plans under a particular DAS dataset may not allow us to understand how DAS affects other similar plans. By using simulation methods, we are able to draw new maps with the DAS data under realistic constraints while taking into account the geography and spatial distributions of populations.  A downside of simulation is that the results will depend on the types of redistricting maps the algorithms are designed to generate.  More studies are needed to better understand how DAS affects different types of redistricting maps.

Why do you use strict thresholds to define equal district population parity, thereby labelling more plans invalid?

The use of strict thresholds is informed by a long series of Supreme Court decisions which establish and define the “one person, one vote” principle.  Current practice for redistricting congressional districts is to minimize the population difference across as much as possible.  As summarized by the National Conference of State Legislatures, in many states, this means getting district populations to be within a single person of one another, based on the Census counts.  If only noisy data is available, drawing districts to minimize the population difference becomes at best ambiguous.  Our analysis demonstrates the current statutory and judicial standards may not be applicable to the DAS-protected data.  On this point, our findings about the magnitude of population changes agree with many other analyses of the DAS and the April 28 Demonstration Data.  If the DAS data will be used for redistricting, the key question will be if and how to change the interpretation of “one person, one vote” principle and how such a change will affect redistricting in practice.

Why do you use a strict threshold of 50% to define the majority-minority districts (MMDs)?

We use a 50% threshold for majority minority districts to remain in line with the jurisprudence following Thornburg v. Gingles (1986). Specifically, Bartlett v. Strickland (2009) held that a minority must “constitute a numerical majority of the voting-age population in an area before §2 requires the creation of a legislative district to prevent dilution of that group’s votes.” When simulating districts, the number of districts which satisfy this specific numerical constraint is thus important. There are other ways to think about minority voting in elections, such as “opportunity districts”, “coalitional districts”, “crossover districts”, and “influence districts.” These have less concrete definitions and rely on electoral data, rather than just Census data.  Certainly, strict majorities are not the only relevant measure of minority voting power, and its coarseness makes it sensitive to change introduced by the DAS.  We find that the practice of using strict thresholds for majority minority districts may lead to biased results especially for small districts such as state legislative districts and school boards.

Why does your analysis not adjust for the differential privacy (DP) mechanism, to properly measure uncertainty?

Our analysis examines the consequences of a map drawer taking DAS-protected data as-is. Ideally, analysts can also adjust for the DP mechanism into that data.  However, the Bureau’s deterministic and asymmetric post-processing procedure cannot be easily incorporated into standard redistricting analyses, preventing analysts from quantifying the added uncertainty exactly.  We recommend that the Bureau release the DP data without post-processing, allowing analysts to adjust redistricting plans with this added uncertainty. 

Do your findings about the prediction of individual race contradict the property of differential privacy?

Technically, no. Differential privacy guarantees that the inference based on a database does not depend on whether or not a particular individual is included in the data.  However, one might argue that for the PL94-171 data, the only sensitive information to be protected is race.  Our prediction methodology combines the census block level racial composition with the publicly available addresses and names of registered voters.  We found that our overall prediction performance does not degrade even if we use the noise-added census block data. And, yet, in our reanalysis of a recent court case where this prediction methodology was prominently used, the prediction appears to be substantively different.  We are in the process of studying when and how these differences arise.  In general, when deciding what privacy to protect at the cost of inaccurate measurements, it is important to consider the consequences of privacy disclosure.  Our finding suggests that the addition of noise to the census data does not necessarily improve the privacy protection in terms of individuals’ race.

What are the differences between your analysis and MGGG’s analysis, both of which are based on simulation methods?

The MGGG Redistricting Lab has also released a study on using DAS data for redistricting purposes, which we believe is well done. In many ways, our results agree with those of the MGGG team. Like the MGGG study, we also find district-level population errors on the scale of hundreds or thousands of people depending on the size of the district (including Section 5 in our analysis and Figure 4 in the MGGG analysis). The core differences in these results come from interpretation. We find similarly sized errors and consider them to be major differences given that districts in many scenarios are drawn to be as exact as possible, down to individual people in many states. For example, Karcher v. Daggett (1983) found that even minute differences in population parity across congressional districts must be justified, even when smaller than the expected error in the decennial census. The MGGG team interprets their results in light of existing noise in current Census data, and say that “the practice of one-person population deviation across districts was never reasonably justified by the accuracy of Census data nor required by law.”  This implies that if DAS data are to be adopted, courts must decide how to change the interpretation of “one person, one vote” principle.

Our analysis differs from the MGGG paper in several ways as well, and more studies are needed in order to better understand the impact of DAS on redistricting analysis and evaluation. First, our analysis relies on different data. We use the April 28th Demonstration Data, which the Census says will “closely approximate” the final data product, and span several states. The MGGG team uses the publicly available implementation of the 2018 TopDown algorithm with a reconstruction experiment to study Texas, mostly Dallas County, so our studies do not overlap in geographic areas.

Second, our analysis is focused on voting precincts (VTDs), which are the building blocks used in constructing redistricting plans in the vast majority of states.  Election results are also reported at this level or the state equivalent. In contrast, the MGGG analysis built districts out of Census blocks, block groups, and tracts.  These geographies are called “on-spine” by the Census bureau, since they nest within each other. Crucially, the DAS is designed to minimize error for these on-spine geographies, but not for off-spine geographies like VTDs. This may explain why our analysis finds more noise in the VTD population and racial counts than the MGGG’s analysis.

Why do you rely on Census data as the ‘ground-truth’? Doesn’t the Census itself already have enumeration error?

Census data, as acknowledged by the Bureau, has errors as well, including undercounts of some minority groups. Despite these inaccuracies, we rely on Census data for our simulations because governments and courts generally consider Census data as “ground-truth” for the purposes of drawing districts (e.g., Karcher v. Daggett 1983). The purpose of our analysis is to investigate how relying on DAS data, rather than on Census data as has generally been done in the past, may lead to different results in some cases. In an ideal world, districts drawn using DAS data would be substantively identical to those drawn with Census data. However, we find cases where districts differ by hundreds or thousands of people depending on whether one relies on the Census or DAS data. Additionally, we believe that more research is necessary to determine whether or not existing enumeration errors will cancel out the error induced by DAS, particularly for “off-spine” but substantively important geographies like VTDs.

As noted by other researchers, enumeration error is also substantively different from the random error injected as part of the DAS.  At small scales like Census blocks, DAS-injected noise is likely on a larger scale than enumeration error—some Census places have seen their population doubled or halved under the DAS, while such a result is unlikely with enumeration methods. As others have pointed out, this has the effect of adding a large random element on top of the more systematic enumeration errors, which tend to overcount or undercount larger areas or entire minority groups.  DAS does not remove or address these systematic errors; it only adds more error on top of them.