What this guide is about
This is a guide for implementing Statistical Disclosure Controls (SDC) on tabular data using R with the sdcTable package. It will include a range of different circumstances that I have encountered.
This guide does not include an introduction into what SDC is all about. The CSO have produced a very good document on Best Practice for Statistical Disclosure Control of Tabular Data (docx, 365KB) which includes an overview of the subject.
Statistical Disclosure Controls are measures that protect against the publication of information relating to an individual person, company or organisation. Implementing robust SDC provides assurance to data providers that their data won’t be inadvertently made public. As mentioned already, this guide covers cell suppression in tabular outputs, which is generally required where certain cross-tabulations are sparsely populated. ‘Unsafe cells’ are identified for primary suppression. Usually other cells need to be suppressed too, so that the value of an unsafe cell cannot be deduced using the aggregate totals and remaining data, and this is called secondary suppression.
Setup
As well as the sdcTable package for R, we will be using the tidyverse collection of packages throughout.
library(tidyverse)
library(sdcTable)
The guide uses a couple of datasets of fictional companies which can be downloaded from here in csv format.
We will start with a dataset called company_sales_1
which can be imported into R as a csv. This dataset contains sales information for 60 fictional companies located in Connaught. There are five columns, the ID and Name of each company, then the County in which it is located, a Y/N indicator to indicate if the company received some kind of Aid, and the value of Sales for the year.
company_sales_1 <- read_csv("input/company_sales_1.csv")
head(company_sales_1)
## # A tibble: 6 x 5
## ID Name County Aid Sales
## <dbl> <chr> <chr> <chr> <dbl>
## 1 1 Jones Engineering Galway Y 779
## 2 2 Mike's Bakery Galway Y 825
## 3 3 Falafel Inc Galway Y 342
## 4 4 Galway Windows Galway Y 563
## 5 5 Tuam Tech Ltd Galway Y 328
## 6 6 Joe Farm Supplies Galway Y 184
Simple run-through
Here we will produce aggregate sales data broken down across the two dimensions, County and Aid. We want aggregates for all Counties (“Connaught”) and all companies including those that did and did not receive Aid.
The key steps are as follows:
- Create hierarchies using
hier_create
- Create sdcProblem object using
makeProblem
- Primary suppression using
primarySuppression
- Secondary suppression using
protectTable
- Review outputs using
getInfo
Later sections will show more complex cases which will be presented as variations of this simple example.
The objects returned by the functions makeProblem
, primarySuppression
and protectTable
have a class and structure which is specific to the sdcTable package. Briefly, these objects contain information on the raw data, the dimensions and the suppression process. We will see how to extract information from them as the process goes on.
1. Defining hierachies
We need to define a hierarchy for each of our two dimensions County
and Aid
which we shall call dimCounty
and dimAid
. These are created using the hier_create
function from the sdcTable
package as shown below. The total category is identified as the ‘root’ of the hierarchy. Note that this total name/classification does not exist in the microdata. The underlying classifications are then created using the nodes
argument to the function. There are other ways of creating hierarchies, either as a matrix or by creating them in a csv and importing that, which could be useful if you have a hierarchy with a large number of classifications.
dimCounty <- hier_create(root = "Connaught",
nodes = c("Galway",
"Mayo",
"Roscommon",
"Sligo",
"Leitrim"))
dimAid <- hier_create(root = "All companies",
nodes = c("Y", "N"))
We can view our hierarchies using the function hier_display
.
hier_display(dimCounty)
## Connaught
## +-Galway
## +-Mayo
## +-Roscommon
## +-Sligo
## \-Leitrim
hier_display(dimAid)
## All companies
## +-Y
## \-N
2. Setting up the SDC problem
We use the function makeProblem
to create an object of class sdcProblem
as shown below. We are making use of three arguments:
data
, which is the dataset containing the microdata;
dimList
, which is a list object containing the hierarchies that we have created, each identified using a name which matches the name of the variable as it appears in the microdata; and
numVarInd
, the name of the variable containing the numerical data that we want to aggregate.
sales_sdc_problem <- makeProblem(data = company_sales_1,
dimList = list(County = dimCounty, Aid = dimAid),
numVarInd = "Sales")
The object sales_sdc_problem
has a specific class called ‘sdcProblem’. It contains information about the microdata, the hierarchies, and protection parameters generated by the function makeProblem
. We can see the structure if we click on the object in Rstudio, but there are two useful functions for extracting information from these objects that you should know about. The first is the base function print
, which outputs a summary of the suppression process so far.
print(sales_sdc_problem)
## The object is a sdcProblem instance with 18 cells in 2 dimension(s)!
## Protection: no
##
## The dimensions are:
## - County (2 levels; 6 codes; of these being 1 aggregates)
## - Aid (2 levels; 3 codes; of these being 1 aggregates)
##
## Current suppression pattern:
## - Primary suppressions: 0
## - Secondary suppressions: 0
## - Publishable cells: 18
The second useful function for examining sdcProblem objects is sdcProb2df
, which comes with the sdcTable package. As the name suggests, it converts the sdcProblem object into a dataframe, which is convenient. You can assign the output to a new object if you like, here I am just showing the output. You will see that there is a column called ‘sdcStatus’. Each value is equal to ‘s’ which indicates that for now each cell is considered ‘safe’. We will see later that some cells will be marked with ‘u’ for ‘unsafe’ by the primary suppression. Later there will be cells marked ‘x’ which are deemed unsafe by the secondary suppression.
sdcProb2df(sales_sdc_problem)
## strID freq sdcStatus County Aid County_o Aid_o
## 1: 0000 60 s 00 00 Connaught All companies
## 2: 0001 26 s 00 01 Connaught Y
## 3: 0002 34 s 00 02 Connaught N
## 4: 0100 20 s 01 00 Galway All companies
## 5: 0101 9 s 01 01 Galway Y
## 6: 0102 11 s 01 02 Galway N
## 7: 0200 10 s 02 00 Mayo All companies
## 8: 0201 4 s 02 01 Mayo Y
## 9: 0202 6 s 02 02 Mayo N
## 10: 0300 12 s 03 00 Roscommon All companies
## 11: 0301 5 s 03 01 Roscommon Y
## 12: 0302 7 s 03 02 Roscommon N
## 13: 0400 10 s 04 00 Sligo All companies
## 14: 0401 4 s 04 01 Sligo Y
## 15: 0402 6 s 04 02 Sligo N
## 16: 0500 8 s 05 00 Leitrim All companies
## 17: 0501 4 s 05 01 Leitrim Y
## 18: 0502 4 s 05 02 Leitrim N
There is an additional argument to sdcProb2df
that you might find useful which is addNumVars
. If it is set to TRUE
(it is FALSE
by default) will include the aggregated numerical variable as an extra column in the output.
3. Primary suppression
Primary suppression is carried out using the function primarySuppression
. The first argument is the sdcProblem object created by the function makeProblem
(in our case this is sales_sdc_problem
). The second argument is type
, and this can take one of four values:
"freq"
for frequency rule with additional parameter maxN
for the maximum number of entities in an unsafe cell (default value is 3)
"nk"
for nk-dominance rule with additional parameters n
and k
(defaults of 2 and 85 respectively, meaning that in a safe cell two entities cannot contribute more than 85% of the aggregate value)
"p"
for p-percent rule with additional parameter p
(default value of 80, meaning that in a safe cell, if the two largest contributors were removed then the remaining value has to be greater than 80% of the largest contributor)
"pq"
for the pq-rule with additional parameter pq
which is a vector of length two. (The default value of pq is c(25,50)
, meaning that in a safe cell, if the two largest contributors are removed then 50% of the remaining value should be greater than 25% of the largest contributor.)
For the "nk"
, "p"
and "pq"
options there is one further mandatory argument called numVarName
which is required to define the name of numerical variable.
Here we’ll use the nk-dominance rule with default parameters:
sales_sdc_primary <- primarySuppression(sales_sdc_problem,
type = "nk",
numVarName = "Sales")
The output returned by the function primarySuppression
is also of a class sdcProblem, so we can use the functions print
and sdcProb2df
on sales_sdc_primary
. Let’s look at the output from print
, which reveals that one cell is unsafe and will require primary suppression. If you look at the dataframe returned by sdcProb2df(sales_sdc_primary)
you can see which cell is marked ‘u’ for ‘unsafe’.
print(sales_sdc_primary)
## The object is a sdcProblem instance with 18 cells in 2 dimension(s)!
## Protection: no
##
## The dimensions are:
## - County (2 levels; 6 codes; of these being 1 aggregates)
## - Aid (2 levels; 3 codes; of these being 1 aggregates)
##
## Current suppression pattern:
## - Primary suppressions: 1
## - Secondary suppressions: 0
## - Publishable cells: 17
4. Secondary suppression
Secondary suppression is carried out using the function protectTable
which has two main arguments. The first argument is the sdcProblem object to which primary suppression has already been applied (sales_sdc_primary
in our case). The second main argument is method
which allows us to choose an algorithm for secondary suppression. There are four options, and here I am copying the descriptions from the Help page for this function:
- “OPT”: protect the complete problem at once using a cut and branch algorithm. The optimal algorithm should be used for small problem-instances only.
- “HITAS”: split the overall problem in smaller problems. These problems are protected using a top-down approach.
- “HYPERCUBE”: protect the complete problem by protecting sub-tables with a fast heuristic that is based on finding and suppressing geometric structures (n-dimensional cubes) that are required to protect primary sensitive table cells.
- “SIMPLEHEURISTIC” and “SIMPLEHEURISTIC_OLD”: heuristic procedures which might be applied to large(r) problem instances.
There are some other arguments for parameters to these algorithms that can be varied but I have not used them. You can investigate them yourself on the Help page for protectTable
.
Here we will use the HITAS method. The resulting object has both primary and secondary suppression applied. The next section shows how to view the results.
sales_sdc_secondary <- protectTable(sales_sdc_primary, method = "HITAS")
5. Reviewing outputs
As already described, the functions makeProblem
and primarySuppression
returned objects of type ‘sdcProblem’, and these were analysed using the functions print
and sdcProb2df
. The class of object returned by protectTable
depends on which version of the sdcTable package you are using. In the current version (0.32) protectTable
also returns a sdcProblem object, but in earlier versions (up to 0.31) it returns a slightly different class of object called ‘safeObj’, and this cannot be examined using print
and sdcProb2df
. However, when it comes to the output from the secondary suppression I find that the best tool to analyse the output is the getInfo
function, and this can be used in the same way regardless of which version of the package you are using.
I am going to create the output in a dataframe called sdc_output
using getInfo
applied to the output from the secondary suppression (sales_sdc_secondary
) and an additional argument type = "finalData"
to request the final data from the suppression.
In the dataframe sdc_output
you will see that there are five columns. The first two are our two dimensions, County and Aid. Note that these classifications appear in the order that they were defined when the hierarchies were created. Then there is Freq
for the frequency or number of companies in each cross-tabulation. Next is Sales
, which is the aggregate value of Sales for each cross-tabulation. Finally there is sdcStatus
which shows the suppression status. We have already seen that it can take values of s
for ‘safe’ cells and u
for cells deemed ‘unsafe’ by the primary suppression. Now we see a third value, x
, for cells that require secondary suppression. There is one ‘unsafe’ cell for companies in Leitrim that received Aid, and a related secondary suppression for Leitrim companies that did not receive Aid. Secondary suppression of one other county is also required so that the Leitrim cells cannot be calculated from the total (Connaught) line. In this case the algorithm has chosen Sligo (rows 14 and 15).
sdc_output <- getInfo(sales_sdc_secondary , type = "finalData")
print(sdc_output)
## County Aid Freq Sales sdcStatus
## 1: Connaught All companies 60 28703 s
## 2: Connaught Y 26 13127 s
## 3: Connaught N 34 15576 s
## 4: Galway All companies 20 9656 s
## 5: Galway Y 9 4770 s
## 6: Galway N 11 4886 s
## 7: Mayo All companies 10 4738 s
## 8: Mayo Y 4 2516 s
## 9: Mayo N 6 2222 s
## 10: Roscommon All companies 12 6660 s
## 11: Roscommon Y 5 2408 s
## 12: Roscommon N 7 4252 s
## 13: Sligo All companies 10 4991 s
## 14: Sligo Y 4 2181 x
## 15: Sligo N 6 2810 x
## 16: Leitrim All companies 8 2658 s
## 17: Leitrim Y 4 1252 u
## 18: Leitrim N 4 1406 x
The variable sdcStatus
can also take a value z
, and we will see later that this can come about when we override the secondary suppression to publish certain cross-tabulations.
Multi-level hierarchies
Let’s examine a very similar situation to the first example, except now instead of having 20 companies from ‘Galway’ we have some which are from ‘Galway City’ and others which are from ‘County Galway’. The remaining counties of Mayo, Roscommon, Sligo and Leitrim are the same.
company_sales_2 <- read_csv("input/company_sales_2.csv")
head(company_sales_2, 12)
## # A tibble: 12 x 5
## ID Name County Aid Sales
## <dbl> <chr> <chr> <chr> <dbl>
## 1 1 Jones Engineering Galway City Y 779
## 2 2 Mike's Bakery Galway City Y 825
## 3 3 Falafel Inc Galway City N 342
## 4 4 Galway Windows Galway City Y 563
## 5 5 Tuam Tech Ltd Galway City N 328
## 6 6 Joe Farm Supplies Galway City N 184
## 7 7 Smith Catering Galway City Y 800
## 8 8 Lavelle Tool Hire Galway City N 453
## 9 9 Hyde Accontants Galway City N 496
## 10 10 Nolan Metals Galway County Y 843
## 11 11 Thomas Meats Galway County Y 48
## 12 12 Duffy Wiring Galway County N 154
In our output tables we want to have a subtotal line for Galway as a whole as well as the total line for Connaught. This is accomplished by creating the hierarchy ‘dimCounty’ as before (we’ll refer to Galway as ‘Galway Total’ this time), but then breaking down the node ‘Galway Total’ into two further nodes using the function hier_add
. The function hier_add
is very similar to hier_create
except that it requires the name of the hierarchy being edited as its first argument. Note that ‘Galway Total’ will appear in the output table but, as with ‘Connaught’, it does not appear in the microdata.
You can add additional breakdowns through repeated use of hier_add
, each time choosing an existing node to use as a root with lower nodes attached.
dimCounty_GalwayBreakdown <- hier_create(root = "Connaught",
nodes = c("Galway Total",
"Mayo",
"Roscommon",
"Sligo",
"Leitrim"))
dimCounty_GalwayBreakdown <- hier_add(dimCounty_GalwayBreakdown,
root = "Galway Total",
nodes = c("Galway City",
"Galway County"))
hier_display(dimCounty_GalwayBreakdown)
## Connaught
## +-Galway Total
## | +-Galway City
## | \-Galway County
## +-Mayo
## +-Roscommon
## +-Sligo
## \-Leitrim
Now let’s run the same suppression as before. We’ll do it in a single step this time by piping the various functions together:
sales_sdc_secondary <- makeProblem(data = company_sales_2,
dimList = list(County = dimCounty_GalwayBreakdown,
Aid = dimAid),
numVarInd = "Sales") %>%
primarySuppression(type = "nk" , numVarName = "Sales") %>%
protectTable(method = "HITAS")
print(getInfo(sales_sdc_secondary , type = "finalData"))
## County Aid Freq Sales sdcStatus
## 1: Connaught All companies 60 28703 s
## 2: Connaught Y 25 13220 s
## 3: Connaught N 35 15483 s
## 4: Galway Total All companies 20 9656 s
## 5: Galway Total Y 8 4863 s
## 6: Galway Total N 12 4793 s
## 7: Galway City All companies 9 4770 s
## 8: Galway City Y 4 2967 x
## 9: Galway City N 5 1803 x
## 10: Galway County All companies 11 4886 s
## 11: Galway County Y 4 1896 u
## 12: Galway County N 7 2990 x
## 13: Mayo All companies 10 4738 s
## 14: Mayo Y 4 2516 s
## 15: Mayo N 6 2222 s
## 16: Roscommon All companies 12 6660 s
## 17: Roscommon Y 5 2408 s
## 18: Roscommon N 7 4252 s
## 19: Sligo All companies 10 4991 s
## 20: Sligo Y 4 2181 x
## 21: Sligo N 6 2810 x
## 22: Leitrim All companies 8 2658 s
## 23: Leitrim Y 4 1252 u
## 24: Leitrim N 4 1406 x
## County Aid Freq Sales sdcStatus
Statistics that are also hierarchical
Let’s examine a situation where we have statistics which we want to publish, but are inter-related. In the file company_data_1
we have ‘Sales’, ‘Input Costs’ and ‘Value Added’, where ‘Value Added’ is equal to ‘Sales’ minus ‘Input Costs’.
company_data_1 <- read_csv("input/company_data_1.csv")
head(company_data_1)
## # A tibble: 6 x 6
## ID Name County Sales `Input Costs` `Value Added`
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 Jones Engineering Galway 779 616 163
## 2 2 Mike's Bakery Galway 825 348 477
## 3 3 Falafel Inc Galway 342 100 242
## 4 4 Galway Windows Galway 563 349 214
## 5 5 Tuam Tech Ltd Galway 328 301 27
## 6 6 Joe Farm Supplies Galway 184 170 14
Because of this relationship, an unsafe cell in one statistic will require that secondary suppression be applied to at least one of the other two statistics, otherwise the unsafe cell would be easily calculated using the remaining data. We need to consider ‘Statistic’ as a dimension and create a hierarchy for it. But first we need to pivot the data into a longer format:
company_data_1 <- company_data_1 %>%
pivot_longer(cols = 4:6 , names_to = "Statistic")
head(company_data_1)
## # A tibble: 6 x 5
## ID Name County Statistic value
## <dbl> <chr> <chr> <chr> <dbl>
## 1 1 Jones Engineering Galway Sales 779
## 2 1 Jones Engineering Galway Input Costs 616
## 3 1 Jones Engineering Galway Value Added 163
## 4 2 Mike's Bakery Galway Sales 825
## 5 2 Mike's Bakery Galway Input Costs 348
## 6 2 Mike's Bakery Galway Value Added 477
Now we can make our hierarchy for Statistic. Note that ‘Sales’ is the root since it is the sum of the other two.
dimStatistic <- hier_create(root = "Sales",
nodes = c("Input Costs",
"Value Added"))
hier_display(dimStatistic)
## Sales
## +-Input Costs
## \-Value Added
Ok, let’s use this and carry out the primary and secondary suppressions. For the County dimension we will use the same hierarchy as in the first example, dimCounty
.
company_data_suppressed <- makeProblem(data = company_data_1 ,
dimList = list(County = dimCounty,
Statistic = dimStatistic),
numVarInd = "value") %>%
primarySuppression(type = "nk" , numVarName = "value") %>%
protectTable(method = "HITAS")
print(getInfo(company_data_suppressed , type = "finalData"))
## County Statistic Freq value sdcStatus
## 1: Connaught Sales 120 28899 s
## 2: Connaught Input Costs 60 19709 s
## 3: Connaught Value Added 60 9190 s
## 4: Galway Sales 40 9656 s
## 5: Galway Input Costs 20 6785 s
## 6: Galway Value Added 20 2871 s
## 7: Mayo Sales 20 4934 s
## 8: Mayo Input Costs 10 3020 x
## 9: Mayo Value Added 10 1914 u
## 10: Roscommon Sales 24 6660 s
## 11: Roscommon Input Costs 12 4501 s
## 12: Roscommon Value Added 12 2159 s
## 13: Sligo Sales 20 4991 s
## 14: Sligo Input Costs 10 3534 s
## 15: Sligo Value Added 10 1457 s
## 16: Leitrim Sales 16 2658 s
## 17: Leitrim Input Costs 8 1869 x
## 18: Leitrim Value Added 8 789 x
Recall that the root of the hierarchy dimCounty
, which is ‘Connaught’, does not appear in the microdata, and that these total values are calculated in the course of the data suppression process. However here we can see that the root of the hierarchy dimStatistic
, ‘Sales’, does appear in the microdata, we have not filtered it out. The sdcTable package is able to handle both circumstances. If we had filtered out the Statistic ‘Sales’ from the dataframe company_data_1
before starting the suppression process then the results would have been the same.
Dimensions without a sum
Sometimes we want to provide a breakdown for a dimension but we don’t want to publish a total figure. This could be the case with time series data, where we need tabular data for each day or month or year, but we don’t plan on publishing some total like ‘All years’. In principle, we could create a hierarchy for a variable like ‘Year’ with nodes representing each individual year and a root called ‘All years’, and then we could just discard the data for ‘All years’ and keep the data for the individual years. The problem with this approach is that it would lead to unnecessary secondary suppressions, because the algorithm would assume that there are more routes for calculating unsafe cells than there actually are. Instead, the data for each year needs to be suppressed separately. This could be done in a loop, as the following example shows.
Let’s look at company_sales_3
which includes sales data for 60 companies across five years from 2010 to 2014.
company_data_3 <- read_csv("input/company_sales_3.csv")
head(company_data_3)
## # A tibble: 6 x 5
## ID Name County Year value
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 Jones Engineering Galway 2010 779
## 2 2 Mike's Bakery Galway 2010 825
## 3 3 Falafel Inc Galway 2010 342
## 4 4 Galway Windows Galway 2010 563
## 5 5 Tuam Tech Ltd Galway 2010 328
## 6 6 Joe Farm Supplies Galway 2010 184
We want to create tabular data with dimensions of County and Year, but with totals for County only. The way I would do this is to use a hierarchy for County (dimCounty
as defined earlier) and loop over Year. As we loop over the years we can store all the suppression data in a list. So we start by making an empty list, which we call suppression_list
. The suppression data that I am storing is the "finalData"
output from the getInfo
function. Before I store it in the list I need to add the Year using the mutate
function because it is lost through the suppression process. At the very end the list is converted into a single dataframe called suppressed_data
using the function bind_rows
.
suppression_list <- list() # make an empty list to store the suppression data
years_vector <- 2010:2014 # create a vector of the years
number_of_years <- length(years_vector) # this is just equal to 5
for (i in 1:number_of_years){ # loop over the years
temp_microdata <- company_data_3 %>%
filter(Year == years_vector[i]) # filter out the data for this year
suppression_list[[i]] <- temp_microdata %>% # carry out an "nk" type suppression
makeProblem(dimList = list(County = dimCounty), numVarInd = "value") %>%
primarySuppression(type = "nk" , numVarName = "value") %>%
protectTable(method = "HITAS") %>%
getInfo(type = "finalData") %>% # extract the final data
mutate(Year = years_vector[i]) # create the variable Year again
}
# convert the list into a single dataframe
suppressed_data <- bind_rows(suppression_list)
head(suppressed_data)
## County Freq value sdcStatus Year
## 1: Connaught 60 24938 s 2010
## 2: Galway 20 9656 s 2010
## 3: Mayo 10 4738 s 2010
## 4: Roscommon 12 2895 u 2010
## 5: Sligo 10 4991 s 2010
## 6: Leitrim 8 2658 x 2010
Note that I am looping over 1-5 rather than 2010-2014 so that I can use the index of the loop as the index of the list that I am creating (suppression_list
).
Protecting certain cross-tabluations from secondary suppression