- Patrick Graham, Christchurch School of Medicine and Health Sciences
- Jim Young, Christchurch School of Medicine and Health Sciences
- Richard Penny, Fidelio Consultancy
- Tony Blakely, Wellington School of Medicine and Health Sciences
- June Atkinson, Wellington School of Medicine and Health Sciences
Supported by U of Otago
- To develop and apply a hierarchical Bayesian approach to smoothing and confidentialising multi-way cross-classified tabular output from sample survey and census datasets.
- To investigate the application of hierarchical Bayesian modelling, to the problem of confidentialising unit record data.
- To scope strategies for hierarchical Bayesian modelling with complex sample survey data (i.e. for designs other than simple random sampling).
Researchers in public health and social science are making increasing use of data held by government agencies such as Statistics New Zealand (SNZ). As researchers seek access to increasingly detailed data from data suppliers, confidentiality and privacy issues are becoming more prominent. Confidentiality concerns currently limit access to census and survey datasets. For example, census output is randomly rounded, unit-record data (data files containing individual-level data) held by SNZ can accessed only on SNZ premises, via the Datalab, and there are limitations on the level of detail which can be extracted from the Datalab. In addition, when highly stratified, even large datasets can yield apparently anomalous results, particularly for smaller strata (e.g. for tertiary qualified Pacific Island males crude mortality rates may appear lower for ages 60-64, than for ages 55-59). This suggests some smoothing of raw cell counts will often aid interpretation of highly cross-classified tabular data. The primary insight of this project is that smoothing is also a confidentialising tool. This project will develop new statistical methodology for smoothing and confidentialising both cross-classified tabular data, and unit-record census and survey data files (for example, the Household Disability Survey). The ultimate aim of this research is to improve research access to the important datasets held by government agencies and to improve the quality of the data released to researchers.
Our methodological development will be within the framework of hierarchical Bayesian modelling (HB) which has emerged over the last thirty years as a flexible approach to statistical modelling in complex multi-parameter settings. Areas of application include spatial modelling, institutional performance, longitudinal studies, and ranking and selection problems.
A characteristic feature of HB models is that final parameter estimates are a compromise between estimates based solely on the data and estimates which would result from full prior commitment to a specified model. Thus, the HB approach permits some smoothing without imposing modelling assumptions on the data in an absolute sense.
SNZ's current approach to confidentialising data released to researchers is based on adding noise, via devices such as random rounding of cell counts in cross-classifications. We are proposing a radically different approach: By developing HB models for smoothing cross-classified tabular data, our methods will protect confidentiality by removing noise. While finely cross-classified data is sufficient for many research purposes, complex analyses are often facilitated by access to individual level data. Here too, the application of HB ideas holds promise. HB models may provide a vehicle for reproduction of the information contained in a given dataset, via use of the fitted model to generate a synthetic dataset in which all the key relations of the original dataset are preserved. Because the resulting dataset is synthetic, confidentiality issues should be minimised and research access to unit record data improved.
The contribution of this project
The methods developed in this project have the potential to improve both research access to data held by government agencies and the quality of the data released by such agencies. In addition, the development of methods that provide some smoothing of extensively cross-classified tabular data without requiring full prior commitment to smooth models, as is typically required by non-hierarchical models, is of independent interest. The development of such methods would greatly assist exploratory and graphical analysis of survey and census data. At a more theoretical level, the use of HB models to reproduce the information content of a dataset may lead to original work concerning the information content of survey and census data and the representation and reproduction of such information. This may have implications for other data-intensive areas, such as data-warehousing.