Statistical Disclosure Control using sdcTable

What this guide is about

This is a guide for implementing Statistical Disclosure Controls (SDC) on tabular data using R with the sdcTable package. It will include a range of different circumstances that I have encountered.

This guide does not include an introduction into what SDC is all about. The CSO have produced a very good document on Best Practice for Statistical Disclosure Control of Tabular Data (docx, 365KB) which includes an overview of the subject.

Statistical Disclosure Controls are measures that protect against the publication of information relating to an individual person, company or organisation. Implementing robust SDC provides assurance to data providers that their data won’t be inadvertently made public. As mentioned already, this guide covers cell suppression in tabular outputs, which is generally required where certain cross-tabulations are sparsely populated. ‘Unsafe cells’ are identified for primary suppression. Usually other cells need to be suppressed too, so that the value of an unsafe cell cannot be deduced using the aggregate totals and remaining data, and this is called secondary suppression.

Setup

As well as the sdcTable package for R, we will be using the tidyverse collection of packages throughout.

library(tidyverse)
library(sdcTable)

The guide uses a couple of datasets of fictional companies which can be downloaded from here in csv format.

We will start with a dataset called company_sales_1 which can be imported into R as a csv. This dataset contains sales information for 60 fictional companies located in Connaught. There are five columns, the ID and Name of each company, then the County in which it is located, a Y/N indicator to indicate if the company received some kind of Aid, and the value of Sales for the year.

company_sales_1 <- read_csv("input/company_sales_1.csv")

head(company_sales_1)

## # A tibble: 6 x 5
##      ID Name              County Aid   Sales
##   <dbl> <chr>             <chr>  <chr> <dbl>
## 1     1 Jones Engineering Galway Y       779
## 2     2 Mike's Bakery     Galway Y       825
## 3     3 Falafel Inc       Galway Y       342
## 4     4 Galway Windows    Galway Y       563
## 5     5 Tuam Tech Ltd     Galway Y       328
## 6     6 Joe Farm Supplies Galway Y       184

Simple run-through

Here we will produce aggregate sales data broken down across the two dimensions, County and Aid. We want aggregates for all Counties (“Connaught”) and all companies including those that did and did not receive Aid.

The key steps are as follows:

Create hierarchies using hier_create
Create sdcProblem object using makeProblem
Primary suppression using primarySuppression
Secondary suppression using protectTable
Review outputs using getInfo

Later sections will show more complex cases which will be presented as variations of this simple example.

The objects returned by the functions makeProblem, primarySuppression and protectTable have a class and structure which is specific to the sdcTable package. Briefly, these objects contain information on the raw data, the dimensions and the suppression process. We will see how to extract information from them as the process goes on.

1. Defining hierachies

We need to define a hierarchy for each of our two dimensions County and Aid which we shall call dimCounty and dimAid. These are created using the hier_create function from the sdcTable package as shown below. The total category is identified as the ‘root’ of the hierarchy. Note that this total name/classification does not exist in the microdata. The underlying classifications are then created using the nodes argument to the function. There are other ways of creating hierarchies, either as a matrix or by creating them in a csv and importing that, which could be useful if you have a hierarchy with a large number of classifications.

dimCounty <- hier_create(root = "Connaught",
                         nodes = c("Galway",
                                   "Mayo",
                                   "Roscommon",
                                   "Sligo",
                                   "Leitrim"))

dimAid <- hier_create(root = "All companies",
                      nodes = c("Y", "N"))

We can view our hierarchies using the function hier_display.

hier_display(dimCounty)

## Connaught
## +-Galway
## +-Mayo
## +-Roscommon
## +-Sligo
## \-Leitrim

hier_display(dimAid)

## All companies
## +-Y
## \-N

2. Setting up the SDC problem

We use the function makeProblem to create an object of class sdcProblem as shown below. We are making use of three arguments:

data, which is the dataset containing the microdata;
dimList, which is a list object containing the hierarchies that we have created, each identified using a name which matches the name of the variable as it appears in the microdata; and
numVarInd, the name of the variable containing the numerical data that we want to aggregate.

sales_sdc_problem <- makeProblem(data = company_sales_1,
                           dimList = list(County = dimCounty, Aid = dimAid),
                           numVarInd = "Sales")

The object sales_sdc_problem has a specific class called ‘sdcProblem’. It contains information about the microdata, the hierarchies, and protection parameters generated by the function makeProblem. We can see the structure if we click on the object in Rstudio, but there are two useful functions for extracting information from these objects that you should know about. The first is the base function print, which outputs a summary of the suppression process so far.

print(sales_sdc_problem)
## The object is a sdcProblem instance with 18 cells in 2 dimension(s)!
## Protection: no
## 
## The dimensions are:
##  - County (2 levels; 6 codes; of these being 1 aggregates)
##  - Aid (2 levels; 3 codes; of these being 1 aggregates)
## 
## Current suppression pattern:
##  - Primary suppressions: 0
##  - Secondary suppressions: 0
##  - Publishable cells: 18

The second useful function for examining sdcProblem objects is sdcProb2df, which comes with the sdcTable package. As the name suggests, it converts the sdcProblem object into a dataframe, which is convenient. You can assign the output to a new object if you like, here I am just showing the output. You will see that there is a column called ‘sdcStatus’. Each value is equal to ‘s’ which indicates that for now each cell is considered ‘safe’. We will see later that some cells will be marked with ‘u’ for ‘unsafe’ by the primary suppression. Later there will be cells marked ‘x’ which are deemed unsafe by the secondary suppression.

sdcProb2df(sales_sdc_problem)

##     strID freq sdcStatus County Aid  County_o         Aid_o
##  1:  0000   60         s     00  00 Connaught All companies
##  2:  0001   26         s     00  01 Connaught             Y
##  3:  0002   34         s     00  02 Connaught             N
##  4:  0100   20         s     01  00    Galway All companies
##  5:  0101    9         s     01  01    Galway             Y
##  6:  0102   11         s     01  02    Galway             N
##  7:  0200   10         s     02  00      Mayo All companies
##  8:  0201    4         s     02  01      Mayo             Y
##  9:  0202    6         s     02  02      Mayo             N
## 10:  0300   12         s     03  00 Roscommon All companies
## 11:  0301    5         s     03  01 Roscommon             Y
## 12:  0302    7         s     03  02 Roscommon             N
## 13:  0400   10         s     04  00     Sligo All companies
## 14:  0401    4         s     04  01     Sligo             Y
## 15:  0402    6         s     04  02     Sligo             N
## 16:  0500    8         s     05  00   Leitrim All companies
## 17:  0501    4         s     05  01   Leitrim             Y
## 18:  0502    4         s     05  02   Leitrim             N

There is an additional argument to sdcProb2df that you might find useful which is addNumVars. If it is set to TRUE (it is FALSE by default) will include the aggregated numerical variable as an extra column in the output.

3. Primary suppression

Primary suppression is carried out using the function primarySuppression. The first argument is the sdcProblem object created by the function makeProblem (in our case this is sales_sdc_problem). The second argument is type, and this can take one of four values:

"freq" for frequency rule with additional parameter maxN for the maximum number of entities in an unsafe cell (default value is 3)
"nk" for nk-dominance rule with additional parameters n and k (defaults of 2 and 85 respectively, meaning that in a safe cell two entities cannot contribute more than 85% of the aggregate value)
"p" for p-percent rule with additional parameter p (default value of 80, meaning that in a safe cell, if the two largest contributors were removed then the remaining value has to be greater than 80% of the largest contributor)
"pq" for the pq-rule with additional parameter pq which is a vector of length two. (The default value of pq is c(25,50), meaning that in a safe cell, if the two largest contributors are removed then 50% of the remaining value should be greater than 25% of the largest contributor.)

For the "nk", "p" and "pq" options there is one further mandatory argument called numVarName which is required to define the name of numerical variable.

Here we’ll use the nk-dominance rule with default parameters:

sales_sdc_primary <- primarySuppression(sales_sdc_problem, 
                                        type = "nk", 
                                        numVarName = "Sales")

The output returned by the function primarySuppression is also of a class sdcProblem, so we can use the functions print and sdcProb2df on sales_sdc_primary. Let’s look at the output from print, which reveals that one cell is unsafe and will require primary suppression. If you look at the dataframe returned by sdcProb2df(sales_sdc_primary) you can see which cell is marked ‘u’ for ‘unsafe’.

print(sales_sdc_primary)
## The object is a sdcProblem instance with 18 cells in 2 dimension(s)!
## Protection: no
## 
## The dimensions are:
##  - County (2 levels; 6 codes; of these being 1 aggregates)
##  - Aid (2 levels; 3 codes; of these being 1 aggregates)
## 
## Current suppression pattern:
##  - Primary suppressions: 1
##  - Secondary suppressions: 0
##  - Publishable cells: 17

4. Secondary suppression

Secondary suppression is carried out using the function protectTable which has two main arguments. The first argument is the sdcProblem object to which primary suppression has already been applied (sales_sdc_primary in our case). The second main argument is method which allows us to choose an algorithm for secondary suppression. There are four options, and here I am copying the descriptions from the Help page for this function:

“OPT”: protect the complete problem at once using a cut and branch algorithm. The optimal algorithm should be used for small problem-instances only.
“HITAS”: split the overall problem in smaller problems. These problems are protected using a top-down approach.
“HYPERCUBE”: protect the complete problem by protecting sub-tables with a fast heuristic that is based on finding and suppressing geometric structures (n-dimensional cubes) that are required to protect primary sensitive table cells.
“SIMPLEHEURISTIC” and “SIMPLEHEURISTIC_OLD”: heuristic procedures which might be applied to large(r) problem instances.

There are some other arguments for parameters to these algorithms that can be varied but I have not used them. You can investigate them yourself on the Help page for protectTable.

Here we will use the HITAS method. The resulting object has both primary and secondary suppression applied. The next section shows how to view the results.

sales_sdc_secondary <- protectTable(sales_sdc_primary, method = "HITAS")

5. Reviewing outputs

As already described, the functions makeProblem and primarySuppression returned objects of type ‘sdcProblem’, and these were analysed using the functions print and sdcProb2df. The class of object returned by protectTable depends on which version of the sdcTable package you are using. In the current version (0.32) protectTable also returns a sdcProblem object, but in earlier versions (up to 0.31) it returns a slightly different class of object called ‘safeObj’, and this cannot be examined using print and sdcProb2df. However, when it comes to the output from the secondary suppression I find that the best tool to analyse the output is the getInfo function, and this can be used in the same way regardless of which version of the package you are using.

I am going to create the output in a dataframe called sdc_output using getInfo applied to the output from the secondary suppression (sales_sdc_secondary) and an additional argument type = "finalData" to request the final data from the suppression.

In the dataframe sdc_output you will see that there are five columns. The first two are our two dimensions, County and Aid. Note that these classifications appear in the order that they were defined when the hierarchies were created. Then there is Freq for the frequency or number of companies in each cross-tabulation. Next is Sales, which is the aggregate value of Sales for each cross-tabulation. Finally there is sdcStatus which shows the suppression status. We have already seen that it can take values of s for ‘safe’ cells and u for cells deemed ‘unsafe’ by the primary suppression. Now we see a third value, x, for cells that require secondary suppression. There is one ‘unsafe’ cell for companies in Leitrim that received Aid, and a related secondary suppression for Leitrim companies that did not receive Aid. Secondary suppression of one other county is also required so that the Leitrim cells cannot be calculated from the total (Connaught) line. In this case the algorithm has chosen Sligo (rows 14 and 15).

sdc_output <- getInfo(sales_sdc_secondary , type = "finalData")

print(sdc_output)

##        County           Aid Freq Sales sdcStatus
##  1: Connaught All companies   60 28703         s
##  2: Connaught             Y   26 13127         s
##  3: Connaught             N   34 15576         s
##  4:    Galway All companies   20  9656         s
##  5:    Galway             Y    9  4770         s
##  6:    Galway             N   11  4886         s
##  7:      Mayo All companies   10  4738         s
##  8:      Mayo             Y    4  2516         s
##  9:      Mayo             N    6  2222         s
## 10: Roscommon All companies   12  6660         s
## 11: Roscommon             Y    5  2408         s
## 12: Roscommon             N    7  4252         s
## 13:     Sligo All companies   10  4991         s
## 14:     Sligo             Y    4  2181         x
## 15:     Sligo             N    6  2810         x
## 16:   Leitrim All companies    8  2658         s
## 17:   Leitrim             Y    4  1252         u
## 18:   Leitrim             N    4  1406         x

The variable sdcStatus can also take a value z, and we will see later that this can come about when we override the secondary suppression to publish certain cross-tabulations.

Multi-level hierarchies

Let’s examine a very similar situation to the first example, except now instead of having 20 companies from ‘Galway’ we have some which are from ‘Galway City’ and others which are from ‘County Galway’. The remaining counties of Mayo, Roscommon, Sligo and Leitrim are the same.

company_sales_2 <- read_csv("input/company_sales_2.csv")

head(company_sales_2, 12)

## # A tibble: 12 x 5
##       ID Name              County        Aid   Sales
##    <dbl> <chr>             <chr>         <chr> <dbl>
##  1     1 Jones Engineering Galway City   Y       779
##  2     2 Mike's Bakery     Galway City   Y       825
##  3     3 Falafel Inc       Galway City   N       342
##  4     4 Galway Windows    Galway City   Y       563
##  5     5 Tuam Tech Ltd     Galway City   N       328
##  6     6 Joe Farm Supplies Galway City   N       184
##  7     7 Smith Catering    Galway City   Y       800
##  8     8 Lavelle Tool Hire Galway City   N       453
##  9     9 Hyde Accontants   Galway City   N       496
## 10    10 Nolan Metals      Galway County Y       843
## 11    11 Thomas Meats      Galway County Y        48
## 12    12 Duffy Wiring      Galway County N       154

In our output tables we want to have a subtotal line for Galway as a whole as well as the total line for Connaught. This is accomplished by creating the hierarchy ‘dimCounty’ as before (we’ll refer to Galway as ‘Galway Total’ this time), but then breaking down the node ‘Galway Total’ into two further nodes using the function hier_add. The function hier_add is very similar to hier_create except that it requires the name of the hierarchy being edited as its first argument. Note that ‘Galway Total’ will appear in the output table but, as with ‘Connaught’, it does not appear in the microdata.

You can add additional breakdowns through repeated use of hier_add, each time choosing an existing node to use as a root with lower nodes attached.

dimCounty_GalwayBreakdown <- hier_create(root = "Connaught",
                         nodes = c("Galway Total",
                                   "Mayo",
                                   "Roscommon",
                                   "Sligo",
                                   "Leitrim"))

dimCounty_GalwayBreakdown <- hier_add(dimCounty_GalwayBreakdown,
                      root = "Galway Total",
                      nodes = c("Galway City", 
                                "Galway County"))

hier_display(dimCounty_GalwayBreakdown)

## Connaught
## +-Galway Total
## | +-Galway City
## | \-Galway County
## +-Mayo
## +-Roscommon
## +-Sligo
## \-Leitrim

Now let’s run the same suppression as before. We’ll do it in a single step this time by piping the various functions together:

sales_sdc_secondary <- makeProblem(data = company_sales_2,
                                   dimList = list(County = dimCounty_GalwayBreakdown, 
                                                  Aid = dimAid),
                                   numVarInd = "Sales") %>% 
  primarySuppression(type = "nk" , numVarName = "Sales") %>% 
  protectTable(method = "HITAS")

print(getInfo(sales_sdc_secondary , type = "finalData"))

##            County           Aid Freq Sales sdcStatus
##  1:     Connaught All companies   60 28703         s
##  2:     Connaught             Y   25 13220         s
##  3:     Connaught             N   35 15483         s
##  4:  Galway Total All companies   20  9656         s
##  5:  Galway Total             Y    8  4863         s
##  6:  Galway Total             N   12  4793         s
##  7:   Galway City All companies    9  4770         s
##  8:   Galway City             Y    4  2967         x
##  9:   Galway City             N    5  1803         x
## 10: Galway County All companies   11  4886         s
## 11: Galway County             Y    4  1896         u
## 12: Galway County             N    7  2990         x
## 13:          Mayo All companies   10  4738         s
## 14:          Mayo             Y    4  2516         s
## 15:          Mayo             N    6  2222         s
## 16:     Roscommon All companies   12  6660         s
## 17:     Roscommon             Y    5  2408         s
## 18:     Roscommon             N    7  4252         s
## 19:         Sligo All companies   10  4991         s
## 20:         Sligo             Y    4  2181         x
## 21:         Sligo             N    6  2810         x
## 22:       Leitrim All companies    8  2658         s
## 23:       Leitrim             Y    4  1252         u
## 24:       Leitrim             N    4  1406         x
##            County           Aid Freq Sales sdcStatus

Statistics that are also hierarchical

Let’s examine a situation where we have statistics which we want to publish, but are inter-related. In the file company_data_1 we have ‘Sales’, ‘Input Costs’ and ‘Value Added’, where ‘Value Added’ is equal to ‘Sales’ minus ‘Input Costs’.

company_data_1 <- read_csv("input/company_data_1.csv")

head(company_data_1)

## # A tibble: 6 x 6
##      ID Name              County Sales `Input Costs` `Value Added`
##   <dbl> <chr>             <chr>  <dbl>         <dbl>         <dbl>
## 1     1 Jones Engineering Galway   779           616           163
## 2     2 Mike's Bakery     Galway   825           348           477
## 3     3 Falafel Inc       Galway   342           100           242
## 4     4 Galway Windows    Galway   563           349           214
## 5     5 Tuam Tech Ltd     Galway   328           301            27
## 6     6 Joe Farm Supplies Galway   184           170            14

Because of this relationship, an unsafe cell in one statistic will require that secondary suppression be applied to at least one of the other two statistics, otherwise the unsafe cell would be easily calculated using the remaining data. We need to consider ‘Statistic’ as a dimension and create a hierarchy for it. But first we need to pivot the data into a longer format:

company_data_1 <- company_data_1 %>% 
  pivot_longer(cols = 4:6 , names_to = "Statistic")

head(company_data_1)

## # A tibble: 6 x 5
##      ID Name              County Statistic   value
##   <dbl> <chr>             <chr>  <chr>       <dbl>
## 1     1 Jones Engineering Galway Sales         779
## 2     1 Jones Engineering Galway Input Costs   616
## 3     1 Jones Engineering Galway Value Added   163
## 4     2 Mike's Bakery     Galway Sales         825
## 5     2 Mike's Bakery     Galway Input Costs   348
## 6     2 Mike's Bakery     Galway Value Added   477

Now we can make our hierarchy for Statistic. Note that ‘Sales’ is the root since it is the sum of the other two.

dimStatistic <- hier_create(root = "Sales",
                         nodes = c("Input Costs",
                                   "Value Added"))

hier_display(dimStatistic)

## Sales
## +-Input Costs
## \-Value Added

Ok, let’s use this and carry out the primary and secondary suppressions. For the County dimension we will use the same hierarchy as in the first example, dimCounty.

company_data_suppressed <- makeProblem(data = company_data_1 ,
                                   dimList = list(County = dimCounty, 
                                                  Statistic = dimStatistic),
                                   numVarInd = "value") %>% 
  primarySuppression(type = "nk" , numVarName = "value") %>% 
  protectTable(method = "HITAS")

print(getInfo(company_data_suppressed , type = "finalData"))

##        County   Statistic Freq value sdcStatus
##  1: Connaught       Sales  120 28899         s
##  2: Connaught Input Costs   60 19709         s
##  3: Connaught Value Added   60  9190         s
##  4:    Galway       Sales   40  9656         s
##  5:    Galway Input Costs   20  6785         s
##  6:    Galway Value Added   20  2871         s
##  7:      Mayo       Sales   20  4934         s
##  8:      Mayo Input Costs   10  3020         x
##  9:      Mayo Value Added   10  1914         u
## 10: Roscommon       Sales   24  6660         s
## 11: Roscommon Input Costs   12  4501         s
## 12: Roscommon Value Added   12  2159         s
## 13:     Sligo       Sales   20  4991         s
## 14:     Sligo Input Costs   10  3534         s
## 15:     Sligo Value Added   10  1457         s
## 16:   Leitrim       Sales   16  2658         s
## 17:   Leitrim Input Costs    8  1869         x
## 18:   Leitrim Value Added    8   789         x

Recall that the root of the hierarchy dimCounty, which is ‘Connaught’, does not appear in the microdata, and that these total values are calculated in the course of the data suppression process. However here we can see that the root of the hierarchy dimStatistic, ‘Sales’, does appear in the microdata, we have not filtered it out. The sdcTable package is able to handle both circumstances. If we had filtered out the Statistic ‘Sales’ from the dataframe company_data_1 before starting the suppression process then the results would have been the same.

Dimensions without a sum

Sometimes we want to provide a breakdown for a dimension but we don’t want to publish a total figure. This could be the case with time series data, where we need tabular data for each day or month or year, but we don’t plan on publishing some total like ‘All years’. In principle, we could create a hierarchy for a variable like ‘Year’ with nodes representing each individual year and a root called ‘All years’, and then we could just discard the data for ‘All years’ and keep the data for the individual years. The problem with this approach is that it would lead to unnecessary secondary suppressions, because the algorithm would assume that there are more routes for calculating unsafe cells than there actually are. Instead, the data for each year needs to be suppressed separately. This could be done in a loop, as the following example shows.

Let’s look at company_sales_3 which includes sales data for 60 companies across five years from 2010 to 2014.

company_data_3 <- read_csv("input/company_sales_3.csv")

head(company_data_3)

## # A tibble: 6 x 5
##      ID Name              County  Year value
##   <dbl> <chr>             <chr>  <dbl> <dbl>
## 1     1 Jones Engineering Galway  2010   779
## 2     2 Mike's Bakery     Galway  2010   825
## 3     3 Falafel Inc       Galway  2010   342
## 4     4 Galway Windows    Galway  2010   563
## 5     5 Tuam Tech Ltd     Galway  2010   328
## 6     6 Joe Farm Supplies Galway  2010   184

We want to create tabular data with dimensions of County and Year, but with totals for County only. The way I would do this is to use a hierarchy for County (dimCounty as defined earlier) and loop over Year. As we loop over the years we can store all the suppression data in a list. So we start by making an empty list, which we call suppression_list. The suppression data that I am storing is the "finalData" output from the getInfo function. Before I store it in the list I need to add the Year using the mutate function because it is lost through the suppression process. At the very end the list is converted into a single dataframe called suppressed_data using the function bind_rows.

suppression_list <- list() # make an empty list to store the suppression data
years_vector <-  2010:2014 # create a vector of the years
number_of_years  <-  length(years_vector) # this is just equal to 5

for (i in 1:number_of_years){ # loop over the years
  temp_microdata <- company_data_3 %>% 
    filter(Year == years_vector[i]) # filter out the data for this year
  
  suppression_list[[i]] <- temp_microdata %>% # carry out an "nk" type suppression
    makeProblem(dimList = list(County = dimCounty), numVarInd = "value") %>% 
    primarySuppression(type = "nk" , numVarName = "value") %>% 
    protectTable(method = "HITAS") %>% 
    getInfo(type = "finalData") %>% # extract the final data
    mutate(Year = years_vector[i]) # create the variable Year again
}

# convert the list into a single dataframe
suppressed_data <- bind_rows(suppression_list) 

head(suppressed_data)

##       County Freq value sdcStatus Year
## 1: Connaught   60 24938         s 2010
## 2:    Galway   20  9656         s 2010
## 3:      Mayo   10  4738         s 2010
## 4: Roscommon   12  2895         u 2010
## 5:     Sligo   10  4991         s 2010
## 6:   Leitrim    8  2658         x 2010

Note that I am looping over 1-5 rather than 2010-2014 so that I can use the index of the loop as the index of the list that I am creating (suppression_list).