Title: | Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control |
---|---|
Description: | A tool for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis. The key objective of generating synthetic data is to replace sensitive original values with synthetic ones causing minimal distortion of the statistical information contained in the data set. Variables, which can be categorical or continuous, are synthesised one-by-one using sequential modelling. Replacements are generated by drawing from conditional distributions fitted to the original data using parametric or classification and regression trees models. Data are synthesised via the function syn() which can be largely automated, if default settings are used, or with methods defined by the user. Optional parameters can be used to influence the disclosure risk and the analytical quality of the synthesised data. For a description of the implemented method see Nowok, Raab and Dibben (2016) <doi:10.18637/jss.v074.i11>. |
Authors: | Beata Nowok [aut, cre], Gillian M Raab [aut], Chris Dibben [ctb], Joshua Snoke [ctb], Caspar van Lissa [ctb] |
Maintainer: | Beata Nowok <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 1.8-0 |
Built: | 2024-11-20 04:20:48 UTC |
Source: | https://github.com/bnowok/synthpop |
Generate synthetic versions of a data set using parametric or CART methods.
Package: | synthpop |
Type: | Package |
Version: | 1.8-0 |
Date: | 2022-08-31 |
License: | GPL-2 | GPL-3 |
Synthetic data are generated from the original (observed) data by the function
syn
. The package includes also tools to compare synthetic data with the
observed data (compare.synds
) and to fit (generalized) linear model to
synthetic data (lm.synds
, glm.synds
) and compare the estimates
with those for the observed data (compare.fit.synds
). More extensive
documentation with illustrative examples is provided in the package vignette.
Beata Nowok, Gillian M Raab, and Chris Dibben based on package mice (2.18) by Stef van Buuren and Karin Groothuis-Oudshoorn
Maintainer: Beata Nowok <[email protected]>
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. doi:10.18637/jss.v074.i11.
Describes features of variables in a data frame relevant for synthesis.
codebook.syn(data, maxlevs = 3)
codebook.syn(data, maxlevs = 3)
data |
a data frame with a data set to be synthesised. |
maxlevs |
the number of factor levels above which separate tables with
all labels are returned as part of |
A list with two components.
tab
- a data frame with the following information about each variable:
name |
variable name |
class |
class of variable |
nmiss |
number of missing values ( |
perctmiss |
percentage of missing values |
ndistinct |
number of distinct values (excluding missing values) |
details |
range for numeric variables, maximum length for character variables, labels for factors with <= maxlevs levels |
labs
- a list of extra tables with labels for each factor with number
of levels greater than maxlevs
.
codebook.syn(SD2011)
codebook.syn(SD2011)
A generic function for comparison of synthesised and observed data. The function invokes particular methods which depend on the class of the first argument.
compare(object, data, ...)
compare(object, data, ...)
object |
a synthetic data object of class |
data |
an original observed data set. |
... |
additional arguments specific to a method. |
Compare methods facilitate quality assessment of synthetic data by comapring
them with the original observed data sets. The data themselves (for class
synds
) or models fitted to them (for class fit.synds
) are
compared.
The value returned by compare
depends on the class of its argument.
See the documentation of the particular methods for details.
compare.synds
, compare.fit.synds
The same model that was used for the synthesised data set is fitted to the
observed data set. The coefficients with confidence intervals for the
observed data is plotted together with their estimates from synthetic data.
When more than one synthetic data set has been generated (object$m>1
)
combining rules are applied. Analysis-specific utility measures are used to
evaluate differences between synthetic and observed data.
## S3 method for class 'fit.synds' compare(object, data, plot = "Z", print.coef = FALSE, return.plot = TRUE, plot.intercept = FALSE, lwd = 1, lty = 1, lcol = c("#1A3C5A","#4187BF"), dodge.height = .5, point.size = 2.5, population.inference = FALSE, ci.level = 0.95, ...) ## S3 method for class 'compare.fit.synds' print(x, print.coef = x$print.coef, ...)
## S3 method for class 'fit.synds' compare(object, data, plot = "Z", print.coef = FALSE, return.plot = TRUE, plot.intercept = FALSE, lwd = 1, lty = 1, lcol = c("#1A3C5A","#4187BF"), dodge.height = .5, point.size = 2.5, population.inference = FALSE, ci.level = 0.95, ...) ## S3 method for class 'compare.fit.synds' print(x, print.coef = x$print.coef, ...)
object |
an object of type |
data |
an original observed data set. |
plot |
values to be plotted: |
print.coef |
a logical value determining whether tables of estimates for the original and synthetic data should be printed. |
return.plot |
a logical value indicating whether a confidence interval plot should be returned. |
plot.intercept |
a logical value indicating whether estimates for intercept should be plotted. |
lwd |
the line type. |
lty |
the line width. |
lcol |
line colours. |
dodge.height |
size of vertical shifts for confidence intervals to prevent overlaping. |
point.size |
size of plotting symbols used to plot point estimates of coefficients. |
population.inference |
a logical value indicating whether intervals for inference to population quantities, as decribed by Karr et al. (2006), should be calculated and plotted. This option suppresses the lack-of-fit test and the standardised differences since these are based on differences standardised by the original interval widths. |
ci.level |
Confidence interval coverage as a proportion. |
... |
additional parameters passed to |
x |
an object of class |
This function can be used to evaluate whether the method used for
synthesis is appropriate for the fitted model. If this is the case the
estimates from the synthetic dataof what would be expected from the original
data xpct(Beta)
xpct(Z)
should not differ from the estimates from
the observed data (Beta
and Z
) by more than would be expected from
the standard errors (se(Beta)
and se(Z)
). For more details see the
vignette on inference.
An object of class compare.fit.synds
which is a list with the
following components:
call |
the original call to fit the model to the synthesised data set. |
coef.obs |
a data frame including estimates based on the observed
data: coefficients ( |
coef.syn |
a data frame including (combined) estimates based on
the synthesised data: point estimates of observed data coefficients
( |
coef.diff |
a data frame containing standardized differences between the coefficients estimated from the original data and those calculated from the combined synthetic data. The difference is standardized by dividing by the estimated standard error of the fit from the original. The corresponding p-values are calculated from a standard Normal distribution and represent the probability of achieving differences as large as those found if the model use for synthesis is compatible with the model that generated the original data. |
mean.abs.std.diff |
Mean absolute standardized difference (over all coefficients). |
ci.overlap |
a data frame containing the percentage of overlap between
the estimated synthetic confidence intervals and the original sample
confidence intervals for each parameter. When |
mean.ci.overlap |
Mean confidence interval overlap (over all coefficients). |
lack.of.fit |
lack-of-fit measure from all |
lof.pvalue |
p-value for the combined lack-of-fit test of the NULL hypothesis that the method used for synthesis retains all relationships between variables that influence the parameters of the fit. |
ci.plot |
|
print.coef |
a logical value determining whether tables of estimates for the original and synthetic data should be printed. |
m |
the number of synthetic versions of the original (observed) data. |
ncoef |
the number of coefficients in the fitted model (including an intercept). |
incomplete |
whether methods for incomplete synthesis due to Reiter (2003) have been used in calculations. |
population.inference |
whether intervals as decribed by Karr et al. (2016) have been calculated. |
Karr, A., Kohnen, C.N., Oganian, A., Reiter, J.P. and Sanil, A.P. (2006). A framework for evaluating the utility of data altered to protect confidentiality. The American Statistician, 60(3), 224-232.
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. doi:10.18637/jss.v074.i11.
Reiter, J.P. (2003) Inference for partially synthetic, public use microdata sets. Survey Methodology, 29, 181-188.
ods <- SD2011[,c("sex","age","edu","smoke")] s1 <- syn(ods, m = 3) f1 <- glm.synds(smoke ~ sex + age + edu, data = s1, family = "binomial") compare(f1, ods) compare(f1, ods, print.coef = TRUE, plot = "coef")
ods <- SD2011[,c("sex","age","edu","smoke")] s1 <- syn(ods, m = 3) f1 <- glm.synds(smoke ~ sex + age + edu, data = s1, family = "binomial") compare(f1, ods) compare(f1, ods, print.coef = TRUE, plot = "coef")
Compare synthesised data set with the original (observed) data set
using percent frequency tables and histograms. When more than one
synthetic data set has been generated (object$m > 1
), by
default pooled synthetic data are used for comparison.
This function can be also used with synthetic data NOT created by
syn()
, but then an additional parameter cont.na
might
need to be provided.
## S3 method for class 'synds' compare(object, data, vars = NULL, msel = NULL, stat = "percents", breaks = 20, nrow = 2, ncol = 2, rel.size.x = 1, utility.stats = c("pMSE", "S_pMSE", "df"), utility.for.plot = "S_pMSE", cols = c("#1A3C5A","#4187BF"), plot = TRUE, table = FALSE, ...) ## S3 method for class 'data.frame' compare(object, data, vars = NULL, cont.na = NULL, msel = NULL, stat = "percents", breaks = 20, nrow = 2, ncol = 2, rel.size.x = 1, utility.stats = c("pMSE", "S_pMSE", "df"), utility.for.plot = "S_pMSE", cols = c("#1A3C5A","#4187BF"), plot = TRUE, table = FALSE, ...) ## S3 method for class 'list' compare(object, data, vars = NULL, cont.na = NULL, msel = NULL, stat = "percents", breaks = 20, nrow = 2, ncol = 2, rel.size.x = 1, utility.stats = c("pMSE", "S_pMSE", "df"), utility.for.plot = "S_pMSE", cols = c("#1A3C5A","#4187BF"), plot = TRUE, table = FALSE, ...) ## S3 method for class 'compare.synds' print(x, ...)
## S3 method for class 'synds' compare(object, data, vars = NULL, msel = NULL, stat = "percents", breaks = 20, nrow = 2, ncol = 2, rel.size.x = 1, utility.stats = c("pMSE", "S_pMSE", "df"), utility.for.plot = "S_pMSE", cols = c("#1A3C5A","#4187BF"), plot = TRUE, table = FALSE, ...) ## S3 method for class 'data.frame' compare(object, data, vars = NULL, cont.na = NULL, msel = NULL, stat = "percents", breaks = 20, nrow = 2, ncol = 2, rel.size.x = 1, utility.stats = c("pMSE", "S_pMSE", "df"), utility.for.plot = "S_pMSE", cols = c("#1A3C5A","#4187BF"), plot = TRUE, table = FALSE, ...) ## S3 method for class 'list' compare(object, data, vars = NULL, cont.na = NULL, msel = NULL, stat = "percents", breaks = 20, nrow = 2, ncol = 2, rel.size.x = 1, utility.stats = c("pMSE", "S_pMSE", "df"), utility.for.plot = "S_pMSE", cols = c("#1A3C5A","#4187BF"), plot = TRUE, table = FALSE, ...) ## S3 method for class 'compare.synds' print(x, ...)
object |
an object of class |
data |
an original (observed) data set. |
vars |
variables to be compared. If |
cont.na |
a named list of codes for missing values for continuous
variables if different from the |
msel |
index or indices of synthetic data copies for which a comparison
is to be made. If |
stat |
determines whether tables and plots present percentages
|
breaks |
the number of cells for the histogram. |
nrow |
the number of rows for the plotting area. |
ncol |
the number of columns for the plotting area. |
rel.size.x |
a number representing the relative size of x-axis labels. |
utility.stats |
a single string or a vector of strings that determines
which utility measures to print. Must be a selection from:
|
utility.for.plot |
a single string that determines which utility
measure to print in facet labels of the plot. Set to |
cols |
bar colors. |
plot |
a logical value with default set to |
table |
a logical value with default set to |
... |
additional parameters. |
x |
an object of class |
Missing data categories for numeric variables are plotted on the same plot
as non-missing values. They are indicated by miss.
suffix.
Numeric variables with fewer than 6 distinct values are changed to factors in order to make plots more readable.
An object of class compare.synds
which is a list including a list
of comparative frequency tables (tables
) and a ggplot object
(plots
) with bar charts/histograms. If multiple plots are produced
they and their corresponding frequency tables are stored as a list.
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. doi:10.18637/jss.v074.i11.
ods <- SD2011[ , c("sex", "age", "edu", "marital", "ls", "income")] s1 <- syn(ods, cont.na = list(income = -8)) ### synthetic data provided as a 'synds' object compare(s1, ods, vars = "ls") compare(s1, ods, vars = "income", stat = "counts", table = TRUE, breaks = 10) ### synthetic data provided as 'data.frame' compare(s1$syn, ods, vars = "ls") compare(s1$syn, ods, vars = "income", cont.na = list(income = -8), stat = "counts", table = TRUE, breaks = 10)
ods <- SD2011[ , c("sex", "age", "edu", "marital", "ls", "income")] s1 <- syn(ods, cont.na = list(income = -8)) ### synthetic data provided as a 'synds' object compare(s1, ods, vars = "ls") compare(s1, ods, vars = "income", stat = "counts", table = TRUE, breaks = 10) ### synthetic data provided as 'data.frame' compare(s1$syn, ods, vars = "ls") compare(s1$syn, ods, vars = "income", cont.na = list(income = -8), stat = "counts", table = TRUE, breaks = 10)
Fits generalized linear models or simple linear models to the synthesised
data set(s) using glm
and lm
function respectively.
glm.synds(formula, family = "binomial", data, ...) lm.synds(formula, data, ...) ## S3 method for class 'fit.synds' print(x, msel = NULL, ...)
glm.synds(formula, family = "binomial", data, ...) lm.synds(formula, data, ...) ## S3 method for class 'fit.synds' print(x, msel = NULL, ...)
formula |
a symbolic description of the model to be estimated.
A typical model has the form |
family |
a description of the error distribution
and link function to be used in the model. See the documentation of
|
data |
an object of class |
... |
|
x |
an object of class |
msel |
index or indices of synthetic data copies for which coefficient
estimates are to be displayed. If |
The summary
function (summary.fit.synds
) can be
used to obtain the combined results of models fitted to each of the m
synthetic data sets.
An object of class fit.synds
. It is a list with the following
components:
call |
the original call to |
mcoefavg |
combined (average) coefficient estimates. |
mvaravg |
combined (average) variance estimates of |
analyses |
|
fitting.function |
function used to fit the model. |
n |
a number of cases in the original data. |
k |
a number of cases in the synthesised data. |
proper |
a logical value indicating whether synthetic data were generated using proper synthesis. |
m |
the number of synthetic versions of the observed data. |
method |
a vector of synthesising methods applied to each variable in the saved synthesised data. |
incomplete |
a logical value indicating whether the dependent variable in the model was not synthesised. |
mcoef |
a matrix of coefficients estimates from all |
mvar |
a matrix of variance estimates from all |
glm
, lm
,
multinom.synds
, polr.synds
,
compare.fit.synds
, summary.fit.synds
### Logit model ods <- SD2011[1:1000, c("sex", "age", "edu", "marital", "ls", "smoke")] s1 <- syn(ods, m = 3) f1 <- glm.synds(smoke ~ sex + age + edu + marital + ls, data = s1, family = "binomial") f1 print(f1, msel = 1:2) ### Linear model ods <- SD2011[1:1000,c("sex", "age", "income", "marital", "depress")] ods$income[ods$income == -8] <- NA s2 <- syn(ods, m = 3) f2 <- lm.synds(depress ~ sex + age + log(income) + marital, data = s2) f2 print(f2,1:3)
### Logit model ods <- SD2011[1:1000, c("sex", "age", "edu", "marital", "ls", "smoke")] s1 <- syn(ods, m = 3) f1 <- glm.synds(smoke ~ sex + age + edu + marital + ls, data = s1, family = "binomial") f1 print(f1, msel = 1:2) ### Linear model ods <- SD2011[1:1000,c("sex", "age", "income", "marital", "depress")] ods$income[ods$income == -8] <- NA s2 <- syn(ods, m = 3) f2 <- lm.synds(depress ~ sex + age + log(income) + marital, data = s2) f2 print(f2,1:3)
Graphical comparisons of a variable (var
) in the synthesised data set
with the original (observed) data set within subgroups defined by the
variables in a vector by
. var
can be a factor or a continuous
variable and the plots produced will depend on the class of var
.
The variables in by
will usually be factors or variables with only
a few values.
multi.compare(object, data, var = NULL, by = NULL, msel = NULL, barplot.position = "fill", cont.type = "hist", y.hist = "count", boxplot.point = TRUE, binwidth = NULL, ...)
multi.compare(object, data, var = NULL, by = NULL, msel = NULL, barplot.position = "fill", cont.type = "hist", y.hist = "count", boxplot.point = TRUE, binwidth = NULL, ...)
object |
an object of class |
data |
an original (observed) data set. |
var |
variable to be compared between observed and synthetic data within subgroups. |
by |
variables to be tabulated or cross-tabulated to form groups. |
barplot.position |
type of barplot. The default |
cont.type |
default |
y.hist |
defines y scale for histograms - |
boxplot.point |
default ( |
msel |
numbers of synthetic data sets to be used - must be numbers in
the range |
binwidth |
sets width of a bin for histograms. |
... |
additional parameters that can be supplied to |
Plots as specified above. A table of the numbers in the subgroups is printed to the R console.
Numeric variables with fewer than 6 distinct values are changed to factors in order to make plots more readable.
compare.synds
, compare.fit.synds
### default synthesis of selected variables vars <- c("sex", "age", "edu", "smoke") ods <- na.omit(SD2011[1:1000, vars]) s1 <- syn(ods) ### categorical var multi.compare(s1, ods, var = "smoke", by = c("sex","edu")) ### numeric var multi.compare(s1, ods, var = "age", by = c("sex"), y.hist = "density", binwidth = 5) multi.compare(s1, ods, var = "age", by = c("sex", "edu"), cont.type = "boxplot")
### default synthesis of selected variables vars <- c("sex", "age", "edu", "smoke") ods <- na.omit(SD2011[1:1000, vars]) s1 <- syn(ods) ### categorical var multi.compare(s1, ods, var = "smoke", by = c("sex","edu")) ### numeric var multi.compare(s1, ods, var = "age", by = c("sex"), y.hist = "density", binwidth = 5) multi.compare(s1, ods, var = "age", by = c("sex", "edu"), cont.type = "boxplot")
Fits multinomial models to the synthesised data set(s)
using the multinom
function.
multinom.synds(formula, data, ...)
multinom.synds(formula, data, ...)
formula |
a symbolic description of the model to be estimated.
A typical model has the form |
data |
an object of class |
... |
additional parameters passed to |
To print the results the print function (print.fit.synds
) can
be used. The summary
function (summary.fit.synds
)
can be used to obtain the combined results of models fitted to each of the
m
synthetic data sets.
An object of class fit.synds
. It is a list with the following
components:
call |
the original call to |
mcoefavg |
combined (average) coefficient estimates. |
mvaravg |
combined (average) variance estimates of |
analyses |
an object summarising the fit to each synthetic data set
or a list of |
fitting.function |
function used to fit the model. |
n |
a number of cases in the original data. |
k |
a number of cases in the synthesised data. |
proper |
a logical value indicating whether synthetic data were generated using proper synthesis. |
m |
the number of synthetic versions of the observed data. |
method |
a vector of synthesising methods applied to each variable in the saved synthesised data. |
incomplete |
a logical value indicating whether the dependent variable in the model was not synthesised. |
mcoef |
a matrix of coefficients estimates from all |
mvar |
a matrix of variance estimates from all |
multinom
, glm.synds
,
polr.synds
, print.fit.synds
,
summary.fit.synds
, compare.fit.synds
ods <- SD2011[1:1000, c("sex", "age", "edu", "marital", "ls", "smoke")] s1 <- syn(ods, m = 3) f1 <- multinom.synds(edu ~ sex + age, data = s1) summary(f1) print(f1, msel = 1:2) compare(f1, ods)
ods <- SD2011[1:1000, c("sex", "age", "edu", "marital", "ls", "smoke")] s1 <- syn(ods, m = 3) f1 <- multinom.synds(edu ~ sex + age, data = s1) summary(f1) print(f1, msel = 1:2) compare(f1, ods)
Selected numeric variables are grouped into factors with ranges selected from the data.
numtocat.syn(data, numtocat = NULL, print.flag = TRUE, cont.na = NULL, catgroups = 5, style.groups = "quantile")
numtocat.syn(data, numtocat = NULL, print.flag = TRUE, cont.na = NULL, catgroups = 5, style.groups = "quantile")
data |
a data frame. |
numtocat |
a vector of numbers or variable names of numeric variables
to be grouped into factors. If |
print.flag |
if TRUE a list of grouped variables is printed. |
cont.na |
a named list that gives the values of the named variables to be
treated as separate categories, often missing values like |
catgroups |
a single integer or a vector of integers indicating the target
number of groups for the variables in numtocat in the same order as numtocat,
or as their relative postions in data. The achieved number of groups may be
different if, for example there are fewer than |
style.groups |
parameter of the function |
A list with the following components:
data |
a data frame with the numeric variables replaced by factors grouped into ranges. |
breaks |
a named list of the breaks used to divide each numeric variable into categories. |
levels |
a named list of the levels for the categories of each numeric variable. |
orig |
a data frame with the original numeric data. |
cont.na |
a named list of the levels for the categorical version of each numeric variable. |
numtocat |
names of the variables changed to categories. |
ind |
positions in data of the variables changed to categories. |
SD2011.cat <- numtocat.syn(SD2011, cont.na = list(income = -8 , unempdur = -8, nofriend = -8)) summary(SD2011.cat$data)
SD2011.cat <- numtocat.syn(SD2011, cont.na = list(income = -8 , unempdur = -8, nofriend = -8)) summary(SD2011.cat$data)
Fits ordered logistic models to the synthesised data set(s)
using the polr
function.
polr.synds(formula, data, ...)
polr.synds(formula, data, ...)
formula |
a symbolic description of the model to be estimated. A typical
model has the form |
data |
an object of class |
... |
additional parameters passed to |
To print the results the print function (print.fit.synds
) can
be used. The summary
function (summary.fit.synds
)
can be used to obtain the combined results of models fitted to each of the
m
synthetic data sets.
An object of class fit.synds
. It is a list with the following
components:
call |
the original call to |
mcoefavg |
combined (average) coefficient estimates. |
mvaravg |
combined (average) variance estimates of |
analyses |
an object summarising the fit to each synthetic data set
or a list of |
fitting.function |
function used to fit the model. |
n |
a number of cases in the original data. |
k |
a number of cases in the synthesised data. |
proper |
a logical value indicating whether synthetic data were generated using proper synthesis. |
m |
the number of synthetic versions of the observed data. |
method |
a vector of synthesising methods applied to each variable in the saved synthesised data. |
incomplete |
a logical value indicating whether the dependent variable in the model was not synthesised. |
mcoef |
a matrix of coefficients estimates from all |
mvar |
a matrix of variance estimates from all |
polr
, glm.synds
,
multinom.synds
, print.fit.synds
,
summary.fit.synds
, compare.fit.synds
ods <- SD2011[1:1000, c("sex", "age", "edu", "marital", "ls", "smoke")] s1 <- syn(ods, m = 3) f1 <- polr.synds(edu ~ sex + age, data = s1) summary(f1) print(f1, msel = 1:2) compare(f1, ods)
ods <- SD2011[1:1000, c("sex", "age", "edu", "marital", "ls", "smoke")] s1 <- syn(ods, m = 3) f1 <- polr.synds(edu ~ sex + age, data = s1) summary(f1) print(f1, msel = 1:2) compare(f1, ods)
Imports data data sets form external files into a data frame.
Currently supported files include: sav (SPSS), dta (Stata), xpt (SAS),
csv (comma-separated file), tab (tab-delimited file) and
txt (delimited text files). For SPSS, Stata and SAS it uses functions from
the foreign
package with some adjustments where necessary.
read.obs(file, convert.factors = TRUE, lab.factors = FALSE, export.lab = FALSE, ...)
read.obs(file, convert.factors = TRUE, lab.factors = FALSE, export.lab = FALSE, ...)
file |
the name of the file (including extension) which the data are to be read from. |
convert.factors |
a logical value indicating whether variables with value labels in Stata and SPSS should be converted into R factors with those levels. |
lab.factors |
a logical value indicating whether variables with
complete value labels but imported using their numeric codes
( |
export.lab |
a logical variable indicating whether labels from SPSS or Stata should be exported to an external file. |
... |
additional parameters passed to read functions. |
A data frame with an imported data set. For SPSS, Stata and SAS it has attributes with labels.
Determines which unique units in the synthesised data set(s) replicates unique units in the original observed data set.
replicated.uniques(object, data, exclude = NULL)
replicated.uniques(object, data, exclude = NULL)
object |
an object of class |
data |
the original observed data set. |
exclude |
a single string or a vector of strings with name(s) of variable(s) to be excluded from the identification of uniques. |
A list with the following components:
replications |
a vector (for |
no.replications |
a single number or a vector of |
no.uniques |
a number of unique individuals in the original data set. |
per.replications |
a single number or a vector of |
ods <- SD2011[1:1000,c("sex","age","edu","marital","smoke")] s1 <- syn(ods, m = 2) replicated.uniques(s1,ods)
ods <- SD2011[1:1000,c("sex","age","edu","marital","smoke")] s1 <- syn(ods, m = 2) replicated.uniques(s1,ods)
Sample of 5,000 individuals from the Social Diagnosis 2011 survey; selected variables only.
SD2011
SD2011
A data frame with 5,000 observations on the following 35 variables:
Sex
Age of person, 2011
Age group, 2011
Category of the place of residence
Region (voivodeship)
Highest educational qualification, 2011
Discipline of completed qualification
Socio-economic status, 2011
Total duration of unemployment in the last 2 years (in months)
Personal monthly net income
Marital status
Month of marriage
Year of marriage
Month of separation/divorce
Year of separation/divorce
Perception of life as a whole
Depression symptoms indicator
View on interpersonal trust
Trust in own family members
Trust in neighbours
Active engagement in some form of sport or exercise
Number of friends
Smoking cigarettes
Number of cigarettes smoked per day
Drinking too much alcohol
Starting to use alcohol to cope with troubles
Working abroad in 2007-2011
Total time spent on working abroad
Plans to go abroad to work in the next two years
Intended duration of working abroad
Intended destination country
Knowledge of English language
Height of person
Weight of person
Body mass index
Please note that the original variable names have been changed to make them more self-explanatory. Some variable labels have been adjusted as well.
Council for Social Monitoring. Social Diagnosis 2000-2011: integrated database. http://www.diagnoza.com/index-en.html [downloaded on 13/12/2013]
Czapinski J. and Panek T. (Eds.) (2011). Social Diagnosis 2011. Objective and Subjective Quality of Life in Poland - full report. Contemporary Economics, Volume 5, Issue 3 (special issue) http://ce.vizja.pl/en/issues/volume/5/issue/3#art254
spineplot(englang ~ agegr, data = SD2011, xlab = "Age group", ylab = "Knowledge of English") boxplot(income ~ sex, data = SD2011[SD2011$income != -8,])
spineplot(englang ~ agegr, data = SD2011, xlab = "Age group", ylab = "Knowledge of English") boxplot(income ~ sex, data = SD2011[SD2011$income != -8,])
Labeling and removing unique replicates of unique actual (observed) individuals.
sdc(object, data, label = NULL, rm.replicated.uniques = FALSE, uniques.exclude = NULL, recode.vars = NULL, bottom.top.coding = NULL, recode.exclude = NULL, smooth.vars = NULL)
sdc(object, data, label = NULL, rm.replicated.uniques = FALSE, uniques.exclude = NULL, recode.vars = NULL, bottom.top.coding = NULL, recode.exclude = NULL, smooth.vars = NULL)
object |
an object of class |
data |
the original (observed) data set. |
label |
a single string with a label to be added to the synthetic data sets as a new variable to make it clear that the data are synthetic/fake. |
rm.replicated.uniques |
a logical value indicating whether unique replicates of units that are unique also in the orginal data set should be removed. |
uniques.exclude |
a single string or a vector of strings with name(s) of variable(s) to be excluded from the identification of uniques. |
recode.vars |
a single string or a vector of strings with name(s) of variable(s) to be bottom- or/and top-coded. |
bottom.top.coding |
a list of two-element vectors specifing
bottom and top codes for each variable in |
recode.exclude |
a list specifying for each variable in
|
smooth.vars |
a single string or a vector of strings with name(s)
of numeric variable(s) to be smoothed ( |
An object
provided as an argument adjusted in accordance with the
other parameters' values.
ods <- SD2011[1:1000,c("sex","age","edu","marital","income")] s1 <- syn(ods, m = 2) s1.sdc <- sdc(s1, ods, label="false_data", rm.replicated.uniques = TRUE, recode.vars = c("age","income"), bottom.top.coding = list(c(20,80),c(NA,2000)), recode.exclude = list(NA,c(NA,-8)))
ods <- SD2011[1:1000,c("sex","age","edu","marital","income")] s1 <- syn(ods, m = 2) s1.sdc <- sdc(s1, ods, label="false_data", rm.replicated.uniques = TRUE, recode.vars = c("age","income"), bottom.top.coding = list(c(20,80),c(NA,2000)), recode.exclude = list(NA,c(NA,-8)))
Combines the results of models fitted to each of the m
synthetic data sets.
## S3 method for class 'fit.synds' summary(object, population.inference = FALSE, msel = NULL, real.varcov = NULL, incomplete = NULL, ...) ## S3 method for class 'summary.fit.synds' print(x, ...)
## S3 method for class 'fit.synds' summary(object, population.inference = FALSE, msel = NULL, real.varcov = NULL, incomplete = NULL, ...) ## S3 method for class 'summary.fit.synds' print(x, ...)
object |
an object of class |
population.inference |
a logical value indicating whether inference
should be made to population quantities. If |
msel |
index or indices of the synthetic datasets ( |
real.varcov |
the estimated variance-covariance matrix of the fit of the
model to the original data. This parameter is used in the function
|
incomplete |
Logical variable as to whether population inference for
incomplete synthesis is to be used. If this is left at a |
... |
additional parameters. |
x |
an object of class |
The mean of the estimates from each of the m synthetic data sets yields asymptotically unbiased estimates of the coefficients if the observed data conform to the distribution used for synthesis. The standard errors are estimated differently depending whether inference is made for the results that we would expect to obtain from the observed data or for the parameters of the population that we assume the observed data are sampled from. The standard errors also differ according to whether synthetic data were produced using simple or proper synthesis (for details see Raab et al. (2017)).
An object of class summary.fit.synds
which is a list with the
following components:
call |
the original call to |
proper |
a logical value indicating whether synthetic data were generated using proper synthesis. |
population.inference |
a logical value indicating whether inference is made to population coefficients or to the results that would be expected from an analysis of the original data (see above). |
incomplete |
a logical value indicating whether the dependent variable
in the model was not synthesised. It is derived in the synthpop
implementation of the fitting functions ( |
fitting.function |
function used to fit the model. |
m |
the number of synthetic versions of the original (observed) data. |
coefficients |
a matrix with combined estimates. If inference is
required to the results that would be obtained from an analysis of the
original data, ( |
n |
a number of cases in the original data. |
k |
the number of cases in the synthesised data. Note that if |
analyses |
|
msel |
index or indices of synthetic data copies for which summaries
of fitted models are produced. If |
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. doi:10.18637/jss.v074.i11.
Raab, G.M., Nowok, B. and Dibben, C. (2017). Practical data synthesis for large samples. Journal of Privacy and Confidentiality, 7(3), 67-97. Available at: https://journalprivacyconfidentiality.org/index.php/jpc/article/view/407
Reiter, J.P. (2003) Inference for partially synthetic, public use microdata sets. Survey Methodology, 29, 181-188.
compare.fit.synds
, summary
, print
ods <- SD2011[1:1000,c("sex","age","edu","ls","smoke")] ### simple synthesis s1 <- syn(ods, m = 5) f1 <- glm.synds(smoke ~ sex + age + edu + ls, data = s1, family = "binomial") summary(f1) summary(f1, population.inference = TRUE) ### proper synthesis s2 <- syn(ods, m = 5, method = "parametric", proper = TRUE) f2 <- glm.synds(smoke ~ sex + age + edu + ls, data = s2, family = "binomial") summary(f2) summary(f2, population.inference = TRUE)
ods <- SD2011[1:1000,c("sex","age","edu","ls","smoke")] ### simple synthesis s1 <- syn(ods, m = 5) f1 <- glm.synds(smoke ~ sex + age + edu + ls, data = s1, family = "binomial") summary(f1) summary(f1, population.inference = TRUE) ### proper synthesis s2 <- syn(ods, m = 5, method = "parametric", proper = TRUE) f2 <- glm.synds(smoke ~ sex + age + edu + ls, data = s2, family = "binomial") summary(f2) summary(f2, population.inference = TRUE)
Produces summaries of the synthesised variables. When more than one
synthetic data set has been generated (object$m > 1), by default summaries
are calculated by averaging summary values for all synthetic data copies
(see msel
argument).
## S3 method for class 'synds' summary(object, msel = NULL, maxsum = 7, digits = max(3, getOption("digits")-3), ...) ## S3 method for class 'summary.synds' print(x, ...)
## S3 method for class 'synds' summary(object, msel = NULL, maxsum = 7, digits = max(3, getOption("digits")-3), ...) ## S3 method for class 'summary.synds' print(x, ...)
object |
an object of class |
msel |
index or indices of synthetic data copies for which a summary
is desired. If |
maxsum |
integer, indicating how many levels should be shown for factors. |
digits |
integer, used for number formatting with |
... |
additional arguments passed to |
x |
an object of class |
See summary
for more details.
An object of class summary.synds
, which is a list with the following
components:
m |
the number of synthetic versions of the original (observed) data. |
msel |
index or indices of synthetic data copies for which a summary
is produced. If |
method |
a vector of synthesising methods applied to each variable in the saved synthesised data. |
result |
a table or a list of tabels (if more than one synthetic data set is selected) with summaries of synthesised variables. |
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. doi:10.18637/jss.v074.i11.
s1 <- syn(SD2011[,c("sex","age","edu","marital")], m = 3) summary(s1) summary(s1, msel = c(1,3))
s1 <- syn(SD2011[,c("sex","age","edu","marital")], m = 3) summary(s1) summary(s1, msel = c(1,3))
Generates synthetic version(s) of a data set. Function syn.strata()
performs stratified synthesis.
syn(data, method = "cart", visit.sequence = (1:ncol(data)), predictor.matrix = NULL, m = 1, k = nrow(data), proper = FALSE, minnumlevels = 1, maxfaclevels = 60, rules = NULL, rvalues = NULL, cont.na = NULL, semicont = NULL, smoothing = NULL, event = NULL, denom = NULL, drop.not.used = FALSE, drop.pred.only = FALSE, default.method = c("normrank", "logreg", "polyreg", "polr"), numtocat = NULL, catgroups = rep(5, length(numtocat)), models = FALSE, print.flag = TRUE, seed = "sample", ...) syn.strata(data, strata = NULL, minstratumsize = 10 + 10 * length(visit.sequence), tab.strataobs = TRUE, tab.stratasyn = FALSE, method = "cart", visit.sequence = (1:ncol(data)), predictor.matrix = NULL, m = 1, k = nrow(data), proper = FALSE, minnumlevels = 1, maxfaclevels = 60, rules = NULL, rvalues = NULL, cont.na = NULL, semicont = NULL, smoothing = NULL, event = NULL, denom = NULL, drop.not.used = FALSE, drop.pred.only = FALSE, default.method = c("normrank", "logreg", "polyreg", "polr"), numtocat = NULL, catgroups = rep(5,length(numtocat)), models = FALSE, print.flag = TRUE, seed = "sample", ...) ## S3 method for class 'synds' print(x, ...)
syn(data, method = "cart", visit.sequence = (1:ncol(data)), predictor.matrix = NULL, m = 1, k = nrow(data), proper = FALSE, minnumlevels = 1, maxfaclevels = 60, rules = NULL, rvalues = NULL, cont.na = NULL, semicont = NULL, smoothing = NULL, event = NULL, denom = NULL, drop.not.used = FALSE, drop.pred.only = FALSE, default.method = c("normrank", "logreg", "polyreg", "polr"), numtocat = NULL, catgroups = rep(5, length(numtocat)), models = FALSE, print.flag = TRUE, seed = "sample", ...) syn.strata(data, strata = NULL, minstratumsize = 10 + 10 * length(visit.sequence), tab.strataobs = TRUE, tab.stratasyn = FALSE, method = "cart", visit.sequence = (1:ncol(data)), predictor.matrix = NULL, m = 1, k = nrow(data), proper = FALSE, minnumlevels = 1, maxfaclevels = 60, rules = NULL, rvalues = NULL, cont.na = NULL, semicont = NULL, smoothing = NULL, event = NULL, denom = NULL, drop.not.used = FALSE, drop.pred.only = FALSE, default.method = c("normrank", "logreg", "polyreg", "polr"), numtocat = NULL, catgroups = rep(5,length(numtocat)), models = FALSE, print.flag = TRUE, seed = "sample", ...) ## S3 method for class 'synds' print(x, ...)
data |
a data frame or a matrix ( |
method |
a single string or a vector of strings of length
|
visit.sequence |
a character vector of names of variables or an integer
vector of their column indices specifying the order of synthesis.
The default sequence |
predictor.matrix |
a square matrix of size |
m |
number of synthetic copies of the original (observed) data to be
generated. The default is |
k |
a size of the synthetic data set ( |
proper |
a logical value with default set to |
minnumlevels |
a minimum number of values a numeric variable should exceed
to be treated as numeric during the synthesis. Numeric variables with only
|
maxfaclevels |
a maximum number of factor levels that can be handled. It can be increased to allow the synthesis to run but too large a value may cause computational problems, especially for parametric methods. |
rules |
a named list of rules for restricted values. Restricted values are those that are determined explicitly by values of other variables. The names of the list elements must correspond to the variables names for which the rules need to be specified. |
rvalues |
a named list of the values corresponding to the rules
specified by |
cont.na |
a named list of codes for missing values for continuous
variables if different from the |
semicont |
a named list of values at which semi-continuous variables have spikes. The names of the list elements must correspond to the names of the semi-continuous variables. |
smoothing |
a single string specifying a smoothing method for all numeric
variables in the data or a named list specifying a smoothing method to be
used for selected variables. Avaliable methods include: |
event |
a named list specifying for survival data the names of corresponding event indicators. The names of the list elements must correspond to the names of the survival variables. |
denom |
a named list specifying for variables to be modelled using binomial regression the names of corresponding denominator variables. The names of the list elements must correspond to the names of the variables to be modelled using binomial regression. |
drop.not.used |
a logical value. If |
drop.pred.only |
a logical value. If |
default.method |
a vector of four strings containing the default
parametric synthesising methods for numerical variables, factors
with two levels, unordered factors with more than two levels
and ordered factors with more than two levels respectively.
They are used when |
numtocat |
a vector of numbers or names to indicate columns of |
catgroups |
An integer or a vector of integers of the same length as
|
models |
if |
print.flag |
if |
seed |
an integer to be used as an argument for the |
... |
additional arguments to be passed to synthesising functions. See section 'Details' below for more information. |
strata |
a numeric vector with strata identifiers or a string vector with names of stratifying variable(s). |
minstratumsize |
minimum size of each stratum. |
tab.strataobs |
a logical value indicating whether a frequency table of the number of observations in strata in the original data set should be printed. |
tab.stratasyn |
a logical value indicating whether a frequency table of the number of observations in strata in the synthetic data set(s) should be printed. |
x |
an object of class |
Only variables that are in visit.sequence
with corresponding non-empty
method
are synthesised. The only exceptions are event indicators. They
are synthesised along with the corresponding time to event variables and should
not be included in visit.sequence
. All other variables (not in
visit.sequence
or in visit.sequence
with a corresponding blank
method) can be used as predictors. Including them in visit.sequence
generates a default predictor.matrix
reflecting the order of variables
in the visit.sequence
otherwise predictor.matrix
has to be
adjusted accordingly. All predictors of the variables that are not in
visit.sequence
or are in visit.sequence
but with a blank method
are removed from predictor.matrix
.
Variables to be synthesised that are not synthesised yet cannot be used
as predictors. Also all variables used in passive synthesis or in restricted
values rules (rules
) have to be synthesised before the variables they
apply to.
Mismatch between data type and synthesising method stops execution and
print an error message but numeric variables with number of levels less
than minnumlevels
are changed into factors and methods are changed
automatically, if necessary, to methods for categorical variables.
Methods for variables not in a visit sequence will be changed into blank.
The built-in elementary synthesising methods defined by conditional distributions include:
classification and regression trees (CART),
see syn.cart
methods using ensembles of CART trees,
see syn.bag
, syn.rf
, and syn.ranger
classification and regression trees (CART)
for duration time data (parametric methods for survival data are
not implemented yet), see syn.survctree
normal linear regression, see syn.norm
normal linear regression preserving the marginal
distribution, see syn.normrank
normal linear regression after
natural logarithmic, square root and cube root transformation of
a dependent variable respectively, see syn.lognorm
logistic regression, see syn.logreg
unordered polytomous regression, see
syn.polyreg
ordered polytomous regression, see syn.polr
predictive mean matching, see syn.pmm
random sample from the observed data,
see syn.sample
function of other synthesised data,
see syn.passive
bootstrap sample within each category of the original
grouping variable, see syn.nested
bootstrap sample within each category of the
crosstabulation of all the predictor variables,
see syn.satcat
These methods use a group of variables that are synthesised together. They must always be together at the start of the visit sequence:
fit a saturated log-linear model,
see syn.catall
fit a log-linear model, defined by its margins, by iterative
proportional fitting see syn.ipf
The functions corresponding to these methods are called syn.method
,
where method
is a string with the name of a synthesising method.
For instance a function corresponding to ctree
function is called
syn.ctree
. A new synthesising method can be introduced by writing
a function named syn.newmethod
and then specifying method
parameter of syn()
function as "newmethod"
.
In order to use "nested"
sampling, method
parameter of syn
function has to be specified as "nested.varname"
, where "varname"
is the name of the grouped (less detailed) variable, the only one used in
nested synthesis. A variable synthesised using "nested"
method is
excluded from synthesising other variables except when used for "nested"
method.
Additional parameters can be passed to synthesising methods as part of the
dots
argument. They have to be named using period-separated method and
parameter name (method.parameter
). For instance, in order to set
a minbucket
(minimum number of observations in any terminal node of
a CART model) for a ctree
synthesising method, ctree.minbucket
has to be specified. The parameters are method-specific and will be used for
all variables to be synthesised using that method. See help for
syn.method
for further details about the allowed parameters for
a specific method.
The summary
function (summary.synds
) can be used
to obtain a summary of the synthesised variables.
An object of class synds
, which stands for 'synthesised
data set'. It is a list with the following components:
call |
an original call to |
m |
number of synthetic versions of the original (observed) data. |
syn |
a data frame (for |
method |
a vector of synthesising methods applied to each variable in the saved synthesised data. |
visit.sequence |
a vector of column indices of the visiting sequence. The indices refer to the columns in the saved synthesised data. |
predictor.matrix |
a matrix specifying the set of predictors used for each variable in the saved synthesised data. |
smoothing |
a vector specifying smoothing methods applied to each variable in the saved synthesised data. |
event |
a vector of integers specifying for survival data the column indices for corresponding event indicators. The indices refer to the columns in the saved synthesised data. |
denom |
a vector of integers specifying for variables modelled using binomial regression the column indices for corresponding denominator variables. The indices refer to the columns in the saved synthesised data. |
proper |
a logical value indicating whether proper synthesis was conducted. |
n |
a number of cases in the original data. |
k |
a number of cases in the synthesised data. |
rules |
a list of rules for restricted values applied to the synthetic data. |
rvalues |
a list of the values corresponding to the rules
specified by |
cont.na |
a list of codes for missing values for continuous variables. |
semicont |
a list of values for semi-continuous variables at which they have spikes. |
drop.not.used |
a logical value indicating whether variables not used in synthesis are saved in the synthesised data and corresponding synthesis parameters. |
drop.pred.only |
a logical value indicating whether variables not synthesised and used as predictors only are saved in the synthesised data. |
models |
if |
seed |
an integer used as a |
var.lab |
a vector of variable labels for data imported from SPSS using
|
val.lab |
a list of value labels for factors for data imported from SPSS
using |
obs.vars |
a vector of all variable names in the observed data set. |
When syn.strata()
is used there are two additiona components:
strata.syn |
a factor variable or a list of factor variables containing
stratum values for all observation units in |
strata.lab |
a character vector of strata labels. |
Note also that when syn.strata
is used most values of the items are matrices
with each row corresponding to a stratum or lists with one element per stratum.
See package vignette for additional information.
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. doi:10.18637/jss.v074.i11.
### selection of variables vars <- c("sex","age","marital","income","ls","smoke") ods <- SD2011[1:1000, vars] ### default synthesis s1 <- syn(ods) s1 ### synthesis with default parametric methods s2 <- syn(ods, method = "parametric", seed = 123) s2$method ### multiple synthesis of selected variables with customised methods s3 <- syn(ods, visit.sequence = c(2, 1, 4, 5), m = 2, method = c("logreg","sample","","normrank","ctree",""), ctree.minbucket = 10) summary(s3) summary(s3, msel = 1:2) ### adjustment to the default predictor matrix s4.ini <- syn(data = ods, visit.sequence = c(1, 2, 5, 3), m = 0, drop.not.used = FALSE) pM.cor <- s4.ini$predictor.matrix pM.cor["marital","ls"] <- 0 s4 <- syn(data = ods, visit.sequence = c(1, 2, 5, 3), predictor.matrix = pM.cor) ### handling missing values in continuous variables s5 <- syn(ods, cont.na = list(income = c(NA, -8))) ### rules for restricted values - marital status of males under 18 should be 'single' s6 <- syn(ods, rules = list(marital = "age < 18 & sex == 'MALE'"), rvalues = list(marital = 'SINGLE'), method = "parametric", seed = 123) with(s6$syn, table(marital[age < 18 & sex == 'MALE'])) ### results for default parametric synthesis without the rule with(s2$syn, table(marital[age < 18 & sex == 'MALE'])) ### synthesis with ipf for all variables s7 <- syn(ods[, 1:3], method = "ipf", numtocat = "age") ### alternatively group the numeric variable before synthesis to save ### the grouped data rather than the numeric in the synthetic data set ods.cat <- numtocat.syn(ods, numtocat = "age", catgroups = 10)$data s8 <- syn(ods.cat[, 1:3], method = "ipf") ### stratified synthesis s9 <- syn.strata(ods, strata = "sex")
### selection of variables vars <- c("sex","age","marital","income","ls","smoke") ods <- SD2011[1:1000, vars] ### default synthesis s1 <- syn(ods) s1 ### synthesis with default parametric methods s2 <- syn(ods, method = "parametric", seed = 123) s2$method ### multiple synthesis of selected variables with customised methods s3 <- syn(ods, visit.sequence = c(2, 1, 4, 5), m = 2, method = c("logreg","sample","","normrank","ctree",""), ctree.minbucket = 10) summary(s3) summary(s3, msel = 1:2) ### adjustment to the default predictor matrix s4.ini <- syn(data = ods, visit.sequence = c(1, 2, 5, 3), m = 0, drop.not.used = FALSE) pM.cor <- s4.ini$predictor.matrix pM.cor["marital","ls"] <- 0 s4 <- syn(data = ods, visit.sequence = c(1, 2, 5, 3), predictor.matrix = pM.cor) ### handling missing values in continuous variables s5 <- syn(ods, cont.na = list(income = c(NA, -8))) ### rules for restricted values - marital status of males under 18 should be 'single' s6 <- syn(ods, rules = list(marital = "age < 18 & sex == 'MALE'"), rvalues = list(marital = 'SINGLE'), method = "parametric", seed = 123) with(s6$syn, table(marital[age < 18 & sex == 'MALE'])) ### results for default parametric synthesis without the rule with(s2$syn, table(marital[age < 18 & sex == 'MALE'])) ### synthesis with ipf for all variables s7 <- syn(ods[, 1:3], method = "ipf", numtocat = "age") ### alternatively group the numeric variable before synthesis to save ### the grouped data rather than the numeric in the synthetic data set ods.cat <- numtocat.syn(ods, numtocat = "age", catgroups = 10)$data s8 <- syn(ods.cat[, 1:3], method = "ipf") ### stratified synthesis s9 <- syn.strata(ods, strata = "sex")
Generates univariate synthetic data using bagging. It uses
randomForest
function from the randomForest package with
number of sampled predictors equal to number of all predictors.
syn.bag(y, x, xp, smoothing = "", proper = FALSE, ntree = 10, ...)
syn.bag(y, x, xp, smoothing = "", proper = FALSE, ntree = 10, ...)
y |
an original data vector of length |
x |
a matrix ( |
xp |
a matrix ( |
smoothing |
smoothing method for numeric variable. See
|
proper |
for proper synthesis ( |
ntree |
number of trees to grow. |
... |
additional parameters passed to
|
...
A list with two components:
res |
a vector of length |
fit |
the model fitted to the observed data that was used to produce synthetic values. |
...
syn
, syn.rf
, syn.cart
,
randomForest
, syn.smooth
A saturated model is fitted to a table produced by cross-tabulating all the variables.
syn.catall(x, k, proper = FALSE, priorn = 1, structzero = NULL, maxtable = 1e8, epsilon = 0, rand = TRUE, ...)
syn.catall(x, k, proper = FALSE, priorn = 1, structzero = NULL, maxtable = 1e8, epsilon = 0, rand = TRUE, ...)
x |
a data frame ( |
k |
a number of rows in each synthetic data set - defaults to |
proper |
if |
priorn |
the sum of the parameters of the Dirichelet prior which can be thought of as a pseudo-count giving the number of observations that inform prior knowledge about the parameters. |
structzero |
a named list of lists that defines which cells in the table
are structural zeros and will remain as zeros in the synthetic data, by
leaving their prior as zeros. Each element of the |
maxtable |
a number of cells in the cross-tabulation of all the variables that will trigger a severe warning. |
epsilon |
measures scale of laplace noise to be added under differential privacy (DP) |
rand |
for DP versions determines if multinomial noise is to be added to DP counts. If it is set to false the DP adjusted counts are simply rounded to a whole number in a manner that preserves the desired sample size (k). |
... |
additional parameters. |
When used in syn
function the group of categorical variables
with method = "catall"
must all be together at the start of the
visit.sequence
. Subsequent variables in visit.sequence
are then
synthesised conditional on the synthesised values of the grouped variables.
A saturated model is fitted to a table produced by cross-tabulating all the
variables. Prior probabilities for the proportions in each cell of the table
are specified from the parameters of a Dirichlet distribution with the same
parameter for every cell in the table that is not a structural zero (see above).
The sum of these parameters is priorn
so that each one is
where
is the number of cells in the table that are not structural zeros.
The default
priorn = 1
can be thought of as equivalent to the knowledge
that 1
observation would be equally likely to be in any cell that is not
a structural zero. The posterior expectation, given the observed counts,
for the probability of being in a cell with observed count
is thus
. The synthetic data are generated
from a multinomial distribution with parameters given by these probabilities.
Unlike syn.satcat
, which fits saturated conditional models,
the synthesised data can include any combination of variables, except
those defined by the combinations of variables in structzero
.
NOTE that when the function is called by setting elements of method in
syn()
to "catall"
, the parameters priorn
, structzero
,
maxtable
, epsilon
, and rand
must be supplied to syn
as e.g. catall.priorn
.
A list with two components:
res |
a data frame of dimension |
fit |
the cross-tabulation of all the original variables used. |
ods <- SD2011[, c(1, 4, 5, 6, 2, 10, 11)] table(ods[, c("placesize", "region")]) # Each `placesize_region` sublist: # for each relevant level of `placesize` defined in the first element, # the second element defines regions (variable `region`) that do not # have places of that size. struct.zero <- list( placesize_region = list(placesize = "URBAN 500,000 AND OVER", region = c(2, 4, 5, 8:13, 16)), placesize_region = list(placesize = "URBAN 200,000-500,000", region = c(3, 4, 10:11, 13)), placesize_region = list(placesize = "URBAN 20,000-100,000", region = c(1, 3, 5, 6, 8, 9, 14:15))) syncatall <- syn(ods, method = c(rep("catall", 4), "ctree", "normrank", "ctree"), catall.priorn = 2, catall.structzero = struct.zero)
ods <- SD2011[, c(1, 4, 5, 6, 2, 10, 11)] table(ods[, c("placesize", "region")]) # Each `placesize_region` sublist: # for each relevant level of `placesize` defined in the first element, # the second element defines regions (variable `region`) that do not # have places of that size. struct.zero <- list( placesize_region = list(placesize = "URBAN 500,000 AND OVER", region = c(2, 4, 5, 8:13, 16)), placesize_region = list(placesize = "URBAN 200,000-500,000", region = c(3, 4, 10:11, 13)), placesize_region = list(placesize = "URBAN 20,000-100,000", region = c(1, 3, 5, 6, 8, 9, 14:15))) syncatall <- syn(ods, method = c(rep("catall", 4), "ctree", "normrank", "ctree"), catall.priorn = 2, catall.structzero = struct.zero)
Generates univariate synthetic data using classification and regression trees (without or with bootstrap).
syn.ctree(y, x, xp, smoothing = "", proper = FALSE, minbucket = 5, mincriterion = 0.9, ...) syn.cart(y, x, xp, smoothing = "", proper = FALSE, minbucket = 5, cp = 1e-08, ...)
syn.ctree(y, x, xp, smoothing = "", proper = FALSE, minbucket = 5, mincriterion = 0.9, ...) syn.cart(y, x, xp, smoothing = "", proper = FALSE, minbucket = 5, cp = 1e-08, ...)
y |
an original data vector of length |
x |
a matrix ( |
xp |
a matrix ( |
smoothing |
smoothing method for numeric variable. See
|
proper |
for proper synthesis ( |
minbucket |
the minimum number of observations in
any terminal node. See |
cp |
complexity parameter. Any split that does not
decrease the overall lack of fit by a factor of cp is not
attempted. Small values of |
mincriterion |
|
... |
additional parameters passed to
|
The procedure for synthesis by a CART model is as follows:
Fit a classification or regression tree by binary recursive partitioning.
For each xp
find the terminal node.
Randomly
draw a donor from the members of the node and take the observed
value of y
from that draw as the synthetic value.
syn.ctree
uses ctree
function from the
party package and syn.cart
uses rpart
function from the rpart package. They differ, among others,
in a selection of a splitting variable and a stopping rule for the
splitting process.
A Guassian kernel smoothing can be applied to continuous variables
by setting smoothing parameter to "density"
. It is recommended
as a tool to decrease the disclosure risk. Increasing minbucket
is another means of data protection.
CART models were suggested for generation of synthetic data by Reiter (2005) and then evaluated by Drechsler and Reiter (2011).
A list with two components:
res |
a vector of length |
fit |
the fitted model which is an object of class |
Reiter, J.P. (2005). Using CART to generate partially synthetic, public use microdata. Journal of Official Statistics, 21(3), 441–462.
Drechsler, J. and Reiter, J.P. (2011). An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Computational Statistics and Data Analysis, 55(12), 3232–3243.
syn
, syn.survctree
,
rpart
, ctree
,
syn.smooth
A fit to the table is obtained from the log-linear fit that matches the numbers in the margins specified by the margin parameters.
syn.ipf(x, k, proper = FALSE, priorn = 1, structzero = NULL, gmargins = "twoway", othmargins = NULL, tol = 1e-3, max.its = 5000, maxtable = 1e8, print.its = FALSE, epsilon = 0, rand = TRUE, ...)
syn.ipf(x, k, proper = FALSE, priorn = 1, structzero = NULL, gmargins = "twoway", othmargins = NULL, tol = 1e-3, max.its = 5000, maxtable = 1e8, print.its = FALSE, epsilon = 0, rand = TRUE, ...)
x |
a data frame of the set of original data to be synthesised. |
k |
a number of rows in each synthetic data set - defaults to |
proper |
if |
priorn |
the sum of the parameters of the Dirichlet prior which can be thought of as a pseudo-count giving the number of observations that inform prior knowledge about the parameters. |
structzero |
a named list of lists that defines which cells in the table
are structural zeros and will remain as zeros in the synthetic data, by
leaving their prior as zeros. Each element of the |
gmargins |
a single character to define a group of margins. At present there is "oneway" and "twoway" option that creates, respectively, all 1-way and 2-way margins from the table. |
othmargins |
a list of margins that will be fitted. If |
tol |
stopping criterion for |
max.its |
maximum umber of iterations allowed for |
maxtable |
the number of cells in the cross-tabulation of all the variables that will trigger a severe warning. |
print.its |
if true the iterations from |
epsilon |
epsilon value for overall differential privacy (DP) parameter. This is implemented by dividing the privacy budget equally over all the margins used to fit the data. |
rand |
when epsilon is > 0 and DP synthetic data are created this determines whether the data are created by Poisson counts from the expected fitted counts in the cells of the DP adjusted data. |
... |
additional parameters. |
When used in syn
function the group of variables with
method = "ipf"
must all be together at the start of the visit sequence.
This function is designed for categorical variables, but it can also be used for
numerical variables if they are categorised by specifying them in the
numtocat
parameter of the main function syn
. Subsequent variables
in visit.sequence
are then synthesised conditional on the synthesised
values of the grouped variables. A fit to the table is obtained from the
log-linear fit that matches the numbers in the margins specified by the margin
parameters. Prior probabilities for the proportions in each cell of the table
are given by a Dirichlet distribution with the same parameter for every cell
in the table that is not a structural zero. The sum of these parameters is
priorn
. The default priorn = 1
can be thought of as equivalent
to the knowledge that 1
observation would be equally likely to
fall in any cell of the table. The synthetic data are generated from a multinomial
distribution with parameters given by the expected posterior probabilities for
each cell of the table. If the maximum likelihood estimate from the log-linear
fit to cell is
and the table has
cells that are not
structural zeros then the expectation of the posterior probability
for this cell is
or
equivalently
.
Unlike syn.satcat
, which fits saturated models from their conditional
distrinutions, x
can include any combination of variables, including
those not present in the original data, except those defined by structzero
.
NOTE that when the function is called by setting elements of
method in syn
to "ipf"
, the parameters priorn
,
structzero
, gmargins
, othmargins
, tol
,
max.its
, maxtable
, print.its
, epsilon
,
and rand
must be supplied to syn
as e.g. ipf.priorn
.
A list with two components:
res |
a data frame with |
fit |
a list made up of two lists: the margins fitted and the original data for each margin. |
ods <- SD2011[, c(1, 4, 5, 6, 2, 10, 11)] table(ods[, c("placesize", "region")]) # Each `placesize_region` sublist: # for each relevant level of `placesize` defined in the first element, # the second element defines regions (variable `region`) that do not # have places of that size. struct.zero <- list( placesize_region = list(placesize = "URBAN 500,000 AND OVER", region = c(2, 4, 5, 8:13, 16)), placesize_region = list(placesize = "URBAN 200,000-500,000", region = c(3, 4, 10:11, 13)), placesize_region = list(placesize = "URBAN 20,000-100,000", region = c(1, 3, 5, 6, 8, 9, 14:15))) synipf <- syn(ods, method = c(rep("ipf", 4), "ctree", "normrank", "ctree"), ipf.gmargins = "twoway", ipf.othmargins = list(c(1, 2, 3)), ipf.priorn = 2, ipf.structzero = struct.zero)
ods <- SD2011[, c(1, 4, 5, 6, 2, 10, 11)] table(ods[, c("placesize", "region")]) # Each `placesize_region` sublist: # for each relevant level of `placesize` defined in the first element, # the second element defines regions (variable `region`) that do not # have places of that size. struct.zero <- list( placesize_region = list(placesize = "URBAN 500,000 AND OVER", region = c(2, 4, 5, 8:13, 16)), placesize_region = list(placesize = "URBAN 200,000-500,000", region = c(3, 4, 10:11, 13)), placesize_region = list(placesize = "URBAN 20,000-100,000", region = c(1, 3, 5, 6, 8, 9, 14:15))) synipf <- syn(ods, method = c(rep("ipf", 4), "ctree", "normrank", "ctree"), ipf.gmargins = "twoway", ipf.othmargins = list(c(1, 2, 3)), ipf.priorn = 2, ipf.structzero = struct.zero)
Generates univariate synthetic data using linear regression
of an outcome variable transformed by natural logarithm (lognorm
),
square root (sqrtnorm
) or cube root (cubertnorm
).
syn.lognorm(y, x, xp, proper = FALSE, ...) syn.sqrtnorm(y, x, xp, proper = FALSE, ...) syn.cubertnorm(y, x, xp, proper = FALSE, ...)
syn.lognorm(y, x, xp, proper = FALSE, ...) syn.sqrtnorm(y, x, xp, proper = FALSE, ...) syn.cubertnorm(y, x, xp, proper = FALSE, ...)
y |
an original data vector of length |
x |
a matrix ( |
xp |
a matrix ( |
proper |
a logical value specifying whether proper synthesis should be conducted. See details. |
... |
additional parameters. |
Generates synthetic values using the spread around the
fitted linear regression line of transformed y
given x
.
For proper synthesis first the regression coefficients are drawn
from normal distribution with mean and variance from the fitted model.
The synthetic values are transformed back to the original scale.
A list with two components:
res |
a vector of length |
fit |
a data frame with regression coefficients and error estimates. |
Generates univariate synthetic data for binary or binomial response variable using logistic regression model.
syn.logreg(y, x, xp, denom = NULL, denomp = NULL, proper = FALSE, ...)
syn.logreg(y, x, xp, denom = NULL, denomp = NULL, proper = FALSE, ...)
y |
an original data vector of length |
x |
a matrix ( |
xp |
a matrix ( |
denom |
an original denominator vector of length |
denomp |
a synthesised denominator vector of length |
proper |
a logical value specifying whether proper synthesis should be conducted. See details. |
... |
additional parameters. |
Synthesis for binary response variables by the non-Bayesian or approximate Bayesian logistic regression model. The non-Bayesian method consists of the following steps:
Fit a logistic regression to the original data.
Calculate predicted inverse logits for synthesied covariates.
Compare the inverse logits to a random (0,1) deviate and get synthetic values.
The Bayesian version (for proper synthesis) includes additional step before computing inverse logits, namely drawing coefficients from normal distribution with mean and variance estimated in step 1.
The method relies on the standard glm.fit
function.
Warnings from glm.fit
are suppressed. Perfect prediction
is handled by the data augmentation method.
A list with two components:
res |
a vector of length |
fit |
a summary of the model fitted to the observed data and used to produce synthetic values. |
Synthesizes one variable (y
) from another one (x
)
when y
is nested in the categories of x
. A bootstrap
sample is created from the original values of y
within each category
of xp
(the synthesised values of the grouping variable).
syn.nested(y, x, xp, smoothing = "", cont.na = NA, ...)
syn.nested(y, x, xp, smoothing = "", cont.na = NA, ...)
y |
an original data vector of length |
x |
an original data vector of length |
xp |
a vector of length |
smoothing |
smoothing method. See |
cont.na |
when y is numeric this can be a list or a vector giving values
of |
... |
additional parameters. |
An example would be when x
is a classification
of occupations and y
is a more detailed sub-classification. It is
intended that x
is a categorical (factor) variable.
A warning will be issued if the original y
is not nested within x
.
A variable synthesised by syn.nested()
is automatically excluded from
predicting later variables because it will provide no extra information,
given its grouping variable.
syn.nested()
is also used for the final synthesis of variables in
syn()
when the option numtocat
is used to synthesise numerical
variables as groups.
A list with two components:
res |
a vector of length |
fit |
a name of the method used for synthesis ( |
Generates univariate synthetic data using linear regression analysis.
syn.norm(y, x, xp, proper = FALSE, ...)
syn.norm(y, x, xp, proper = FALSE, ...)
y |
an original data vector of length |
x |
a matrix ( |
xp |
a matrix ( |
proper |
a logical value specifying whether proper synthesis should be conducted. See details. |
... |
additional parameters. |
Generates synthetic values using the spread around the
fitted linear regression line of y
given x
.
For proper synthesis first the regression coefficients
are drawn from normal distribution with mean and variance
from the fitted model.
A list with two components:
res |
a vector of length |
fit |
a data frame with regression coefficients and error estimates. |
syn
, syn.normrank
, syn.lognorm
Generates univariate synthetic data using linear regression analysis and preserves the marginal distribution. Regression is carried out on Normal deviates of ranks in the original variable. Synthetic values are assigned from the original values based on the synthesised ranks that are transformed from their synthesised Normal deviates.
syn.normrank(y, x, xp, smoothing = "", proper = FALSE, ...)
syn.normrank(y, x, xp, smoothing = "", proper = FALSE, ...)
y |
an original data vector of length |
x |
a matrix ( |
xp |
a matrix ( |
smoothing |
smoothing method. See |
proper |
a logical value specifying whether proper synthesis should be conducted. See details. |
... |
additional parameters. |
First generates synthetic values of Normal deviates of ranks of
the values in y
using the spread around the fitted
linear regression line of Normal deviates of ranks given x
.
Then synthetic Normal deviates of ranks are transformed back to
get synthetic ranks which are used to assign values from
y
.
For proper synthesis first the regression coefficients
are drawn from normal distribution with mean and variance
from the fitted model.
A smoothing methods can be applied by setting smoothing parameter (see
syn.smooth
). It is recommended as a tool to decrease the
disclosure risk.
A list with two components:
res |
a vector of length |
fit |
a data frame with regression coefficients and error estimates. |
syn
, syn.norm
, syn.lognorm
,
syn.smooth
Derives a new variable according to a specified function of synthesised data.
syn.passive(data, func)
syn.passive(data, func)
data |
a data frame with synthesised data. |
func |
a |
Any function of the synthesised data can be specified. Note that several operators such as
+
, -
, *
and ^
have different meanings in formula
syntax.
Use the identity function I()
if they should be interpreted as arithmetic operators,
e.g. "~I(age^2)"
.
Function syn()
checks whether the passive assignment is correct in the original data
and fails with a warning if this is not true. The variables synthesised passively can be
used to predict later variables in the synthesis except when they are numeric variables
with missing data. A warning is produced in this last case.
A list with two components:
res |
a vector of length |
fit |
a name of the method used for synthesis ( |
Gillian Raab, 2021 based on Stef van Buuren, Karin Groothuis-Oudshoorn, 2000
Van Buuren, S. and Groothuis-Oudshoorn, K. (2011).
mice
: Multivariate Imputation by Chained Equations
in R
. Journal of Statistical Software,
45(3), 1-67. doi:10.18637/jss.v045.i03
### the examples shows how inconsistencies in the SD2011 data are picked up ### by syn.passive() ods <- SD2011[, c("height", "weight", "bmi", "age", "agegr")] ods$hsq <- ods$height^2 ods$sex <- SD2011$sex meth <- c("cart", "cart", "~I(weight / height^2 * 10000)", "cart", "~I(cut(age, c(15, 24, 34, 44, 59, 64, 120)))", "~I(height^2)", "logreg") ## Not run: ### fails for bmi s1 <- syn(ods, method = meth, seed = 6756, models = TRUE) ### fails for agegr ods$bmi <- ods$weight / ods$height^2 * 10000 s2 <- syn(ods, method = meth, seed = 6756, models = TRUE) ### fails because of wrong order ods$agegr <- cut(ods$age, c(15, 24, 34, 44, 59, 64, 120)) s3 <- syn(ods, method = meth, visit.sequence = 7:1, seed = 6756, models = TRUE) ## End(Not run) ### runs without errors ods$bmi <- ods$weight / ods$height^2 * 10000 ods$agegr <- cut(ods$age, c(15, 24, 34, 44, 59, 64, 120)) s4 <- syn(ods, method = meth, seed = 6756, models = TRUE) ### bmi and hsq do not predict sex because of missing values s4$models$sex ### hsq with no missing values used to predict sex ods2 <- ods[!is.na(ods$height),] s5 <- syn(ods2, method = meth, seed = 6756, models = TRUE) s5$models$sex ### agegr with missing values used to predict sex because not numeric ods3 <- ods ods3$age[1:4] <- NA ods3$agegr <- cut(ods3$age, c(15, 24, 34, 44, 59, 64, 120)) s6 <- syn(ods3, method = meth, seed = 6756, models = TRUE) s6$models$sex
### the examples shows how inconsistencies in the SD2011 data are picked up ### by syn.passive() ods <- SD2011[, c("height", "weight", "bmi", "age", "agegr")] ods$hsq <- ods$height^2 ods$sex <- SD2011$sex meth <- c("cart", "cart", "~I(weight / height^2 * 10000)", "cart", "~I(cut(age, c(15, 24, 34, 44, 59, 64, 120)))", "~I(height^2)", "logreg") ## Not run: ### fails for bmi s1 <- syn(ods, method = meth, seed = 6756, models = TRUE) ### fails for agegr ods$bmi <- ods$weight / ods$height^2 * 10000 s2 <- syn(ods, method = meth, seed = 6756, models = TRUE) ### fails because of wrong order ods$agegr <- cut(ods$age, c(15, 24, 34, 44, 59, 64, 120)) s3 <- syn(ods, method = meth, visit.sequence = 7:1, seed = 6756, models = TRUE) ## End(Not run) ### runs without errors ods$bmi <- ods$weight / ods$height^2 * 10000 ods$agegr <- cut(ods$age, c(15, 24, 34, 44, 59, 64, 120)) s4 <- syn(ods, method = meth, seed = 6756, models = TRUE) ### bmi and hsq do not predict sex because of missing values s4$models$sex ### hsq with no missing values used to predict sex ods2 <- ods[!is.na(ods$height),] s5 <- syn(ods2, method = meth, seed = 6756, models = TRUE) s5$models$sex ### agegr with missing values used to predict sex because not numeric ods3 <- ods ods3$age[1:4] <- NA ods3$agegr <- cut(ods3$age, c(15, 24, 34, 44, 59, 64, 120)) s6 <- syn(ods3, method = meth, seed = 6756, models = TRUE) s6$models$sex
Generates univariate synthetic data using predictive mean matching.
syn.pmm(y, x, xp, smoothing = "", proper = FALSE, ...)
syn.pmm(y, x, xp, smoothing = "", proper = FALSE, ...)
y |
an original data vector of length |
x |
a matrix ( |
xp |
a matrix ( |
proper |
a logical value specifying whether proper synthesis should be conducted. See details. |
smoothing |
smoothing method. See documentation for
|
... |
additional parameters. |
Synthesis of y
by predictive mean matching. The procedure
is as follows:
Fit a linear regression to the original data.
Compute predicted values y.hat
and ysyn.hat
for the original x
and synthesised
xp
covariates respectively.
For each predicted value ysyn.hat
find donor
observations with the closest predicted values y.hat
(ties are broken by random selection), randomly sample one of
them and take its observed value y
as the synthetic value.
The Bayesian version (for proper synthesis) includes additional step before computing predicted values:
Draw coefficients from normal distribution with mean and variance estimated in step 1 and use them to calculate predicted values for the synthesised covariates.
A list with two components:
res |
a vector of length |
fit |
a data frame with regression coefficients and error estimates. |
Generates a synthetic categorical variable using ordered polytomous regression (without or with bootstrap).
syn.polr(y, x, xp, proper = FALSE, maxit = 1000, trace = FALSE, MaxNWts = 10000, ...)
syn.polr(y, x, xp, proper = FALSE, maxit = 1000, trace = FALSE, MaxNWts = 10000, ...)
y |
an original data vector of length |
x |
a matrix ( |
xp |
a matrix ( |
proper |
for proper synthesis ( |
maxit |
the maximum number of iterations for |
trace |
switch for tracing optimization for |
MaxNWts |
the maximum allowable number of weights for |
... |
Generates synthetic ordered categorical variables by the proportional odds logistic regression (polr) model. The function repeatedly applies logistic regression on the successive splits. The model is also known as the cumulative link model.
The algorithm of syn.polr
uses the
function polr
from the MASS package.
In order to avoid bias due to perfect prediction, the data are augmented by the method of White, Daniel and Royston (2010).
In case the call to polr
fails,
usually because the data are very sparse,
multinom
function is used instead.
A list with two components:
res |
a vector of length |
fit |
a summary of the model fitted to the observed data and used to produce synthetic values. |
White, I.R., Daniel, R. and Royston, P. (2010). Avoiding bias due to perfect prediction in multiple imputation of incomplete categorical variables. Computational Statistics and Data Analysis, 54, 2267–2275.
syn
,syn.polyreg
multinom
,
polr
Generates a synthetic categorical variable using unordered polytomous regression (without or with bootstrap).
syn.polyreg(y, x, xp, proper = FALSE, maxit = 1000, trace = FALSE, MaxNWts = 10000, ...)
syn.polyreg(y, x, xp, proper = FALSE, maxit = 1000, trace = FALSE, MaxNWts = 10000, ...)
y |
an original data vector of length |
x |
a matrix ( |
xp |
a matrix ( |
proper |
for proper synthesis ( |
maxit |
the maximum number of iterations for |
trace |
switch for tracing optimization for |
MaxNWts |
the maximum allowable number of weights for |
... |
additional parameters passed to |
Generates synthetic categorical variables by the polytomous regression model. The method consists of the following steps:
Fit categorical response as a multinomial model.
Compute predicted categories.
Add appropriate noise to predictions.
The algorithm of syn.polyreg
uses the function
multinom
from the nnet package. Any numerical
variables are scaled to cover the range (0,1) before fitting. Warnings
are printed if the algorithm fails to converge in maxit
iterations
and also if the synthesised data has only one category. The latter may occur
if the variable being synthesised is sparse so that the algorithm fails to
iterate.
In order to avoid bias due to perfect prediction, the data are augmented by the method of White, Daniel and Royston (2010).
NOTE that when the function is called by setting elements of method in syn()
to "polyreg"
, the parameters maxit
, trace
and MaxNWts
can be supplied to syn()
as e.g. polyreg.maxit
.
A list with two components:
res |
a vector of length |
fit |
a summary of the model fitted to the observed data and used to produce synthetic values. |
White, I.R., Daniel, R. and Royston, P. (2010). Avoiding bias due to perfect prediction in multiple imputation of incomplete categorical variables. Computational Statistics and Data Analysis, 54, 2267–2275.
Generates univariate synthetic data using a fast implementation of
random forests. It uses ranger
function
from the ranger package.
syn.ranger(y, x, xp, smoothing = "", proper = FALSE, ...)
syn.ranger(y, x, xp, smoothing = "", proper = FALSE, ...)
y |
an original data vector of length |
x |
a matrix ( |
xp |
a matrix ( |
smoothing |
smoothing method for numeric variable. See
|
proper |
for proper synthesis ( |
... |
additional parameters passed to
|
...
A list with two components:
res |
a vector of length |
fit |
the model fitted to the observed data that was used to produce synthetic values. |
...
syn
, syn.rf
,
syn.bag
, syn.cart
,
ranger
, syn.smooth
Generates univariate synthetic data using Breiman's random forest algorithm
classification and regression. It uses randomForest
function
from the randomForest package.
syn.rf(y, x, xp, smoothing = "", proper = FALSE, ntree = 10, ...)
syn.rf(y, x, xp, smoothing = "", proper = FALSE, ntree = 10, ...)
y |
an original data vector of length |
x |
a matrix ( |
xp |
a matrix ( |
smoothing |
smoothing method for numeric variable. See
|
proper |
for proper synthesis ( |
ntree |
number of trees to grow. |
... |
additional parameters passed to
|
...
A list with two components:
res |
a vector of length |
fit |
the fitted model which is an object of class |
...
syn
, syn.rf
,
syn.bag
, syn.cart
,
randomForest
,
syn.smooth
Generates a random sample from the observed data.
syn.sample(y, xp, smoothing = "", cont.na = NA, proper = FALSE, ...)
syn.sample(y, xp, smoothing = "", cont.na = NA, proper = FALSE, ...)
y |
an original data vector of length |
xp |
a target length |
smoothing |
smoothing method for numeric variable. See documentation
for |
cont.na |
a vector of codes for missing values for continuous variables that should be excluded from smoothing. |
proper |
if |
... |
additional parameters passed to |
A simple random sample with replacement is taken from the
observed values in y
and used as synthetic values.
A Guassian kernel smoothing can be applied to continuous variables
by setting smoothing parameter to "density"
. It is recommended
as a tool to decrease the disclosure risk.
A list with two components:
res |
a vector of length |
fit |
a name of the method used for synthesis ( |
Synthesises one variable (y
) from all possible
combinations of its precitors (x
). A bootstrap sample is created
from the original values of y
within each unique combinations of
of xp
(the syntheisied values of the grouping variable).
syn.satcat(y, x, xp, proper = FALSE, ...)
syn.satcat(y, x, xp, proper = FALSE, ...)
y |
an original data vector of length |
x |
a matrix ( |
xp |
a matrix ( |
proper |
if |
... |
additional parameters. |
It is intended that the variables in x
are categorical (factor)
variables. If y
is also a categorical variable syn.satcat
will
give the same results as fitting a saturated polychotomous regression model but
will usually be much faster. syn.satcat
will fail with an error message
if previous syntheses have generated a combination of variables in xp
that was not present in x
. Use of the syn.catall
method for
grouped variables can overcome this.
A list with two components:
res |
a data frame of dimension |
fit |
the cross-tabulation of the original predictor variables. |
ods <- SD2011[, c("region", "sex", "agegr", "placesize")] s1 <- syn(ods, method = c("sample", "cart", "satcat", "cart")) ## Not run: ### mostly fails because too many small categories s2 <- syn(ods, method = c("sample", "cart", "cart", "satcat")) ## End(Not run)
ods <- SD2011[, c("region", "sex", "agegr", "placesize")] s1 <- syn(ods, method = c("sample", "cart", "satcat", "cart")) ## Not run: ### mostly fails because too many small categories s2 <- syn(ods, method = c("sample", "cart", "cart", "satcat")) ## End(Not run)
Implements three different smoothing methods for numeric data.
syn.smooth(ysyn, yobs = NULL, smoothing = "spline", window = 5, ...)
syn.smooth(ysyn, yobs = NULL, smoothing = "spline", window = 5, ...)
ysyn |
non-missing synthetic data to be smoothed. |
yobs |
original data used by all methodds to determine number of
decimal places and by method |
smoothing |
a character vector that can take values |
window |
width of window for running mean. |
... |
additional parameters. |
Smooths numeric variables by three methods. Default is "spline"
that
uses a smoothing spline, others are "density"
that uses a Gaussian
kernel density estimator with bandwidth selected using the Sheather-Jones
'solve-the-equation' method (see bw.SJ
) and "rmean"
that smooths with a running mean of width "window"
(see
runningmean
).
A vector of smoothed values of ysyn
.
syn
, syn.sample
, syn.normrank
,
syn.pmm
, syn.ctree
, syn.cart
,
syn.bag
, syn.rf
, syn.ranger
,
syn.nested
Generates synthetic event indicator and time to event data using classification and regression trees (without or with bootstrap).
syn.survctree(y, yevent, x, xp, proper = FALSE, minbucket = 5, ...)
syn.survctree(y, yevent, x, xp, proper = FALSE, minbucket = 5, ...)
y |
a vector of length |
yevent |
a vector of length |
x |
a matrix ( |
xp |
a matrix ( |
proper |
for proper synthesis ( |
minbucket |
the minimum number of observations in
any terminal node. See |
... |
additional parameters passed to |
The procedure for synthesis by a CART model is as follows:
Fit a tree-structured survival model by binary recursive partitioning (the terminal nodes include Kaplan-Meier estimates of the survival time).
For each xp
find the terminal node.
Randomly
draw a donor from the members of the node and take the observed
value of yevent
and y
from that draw as the
synthetic values.
The function is used in syn()
to generate survival times
by setting elements of method in syn()
to "survctree"
.
Additional parameters related to ctree
function,
e.g. minbucket
can be supplied to syn()
as
survctree.minbucket
.
Where the survival variable is censored this information must be supplied
to syn()
as a named list (event) that gives the name of the variable
for each event indicator. Event variables can be a numeric variable with
values 1/0 (1 = event), TRUE/FALSE (TRUE = event) or a factor with 2 levels
(level 2 = event). The event variable(s) will be synthesised along with the
survival time(s).
A list with the following components:
syn.time |
a vector of length |
syn.event |
a vector of length |
fit |
the fitted model which is an item of class |
### This example uses the data set 'mgus2' from the survival package. ### It has a follow-up time variable 'futime' and an event indicator 'death'. library(survival) ### first exclude the 'id' variable and run a dummy synthesis to get ### a method vector ods <- mgus2[-1] s0 <- syn(ods) ### create new method vector including 'survctree' for 'futime' and create ### an event list for it; the names of the list element must correspond to ### the name of the follow-up variable for which the event indicator ### need to be specified. meth <- s0$method meth[names(meth) == "futime"] <- "survctree" evlist <- list(futime = "death") s1 <- syn(ods, method = meth, event = evlist) ### evaluate outputs ## compare selected variables compare(s1, ods, vars = c("futime", "death", "sex", "creat")) ## compare original and synthetic follow up time by an event indicator multi.compare(s1, ods, var = "futime", by = "death") ## compare survival curves for original and synthetic data par(mfrow = c(2,1)) plot(survfit(Surv(futime, death) ~ sex, data = ods), col = 1:2, xlim = c(0,450), main = "Original data") legend("topright", levels(ods$sex), col = 1:2, lwd = 1, bty = "n") plot(survfit(Surv(futime, death) ~ sex, data = s1$syn), col = 1:2, xlim = c(0,450), main = "Synthetic data")
### This example uses the data set 'mgus2' from the survival package. ### It has a follow-up time variable 'futime' and an event indicator 'death'. library(survival) ### first exclude the 'id' variable and run a dummy synthesis to get ### a method vector ods <- mgus2[-1] s0 <- syn(ods) ### create new method vector including 'survctree' for 'futime' and create ### an event list for it; the names of the list element must correspond to ### the name of the follow-up variable for which the event indicator ### need to be specified. meth <- s0$method meth[names(meth) == "futime"] <- "survctree" evlist <- list(futime = "death") s1 <- syn(ods, method = meth, event = evlist) ### evaluate outputs ## compare selected variables compare(s1, ods, vars = c("futime", "death", "sex", "creat")) ## compare original and synthetic follow up time by an event indicator multi.compare(s1, ods, var = "futime", by = "death") ## compare survival curves for original and synthetic data par(mfrow = c(2,1)) plot(survfit(Surv(futime, death) ~ sex, data = ods), col = 1:2, xlim = c(0,450), main = "Original data") legend("topright", levels(ods$sex), col = 1:2, lwd = 1, bty = "n") plot(survfit(Surv(futime, death) ~ sex, data = s1$syn), col = 1:2, xlim = c(0,450), main = "Synthetic data")
Distributional comparison of synthesised data set with the original (observed) data set using propensity scores.
This function can be also used with synthetic data NOT created by
syn()
, but then additional parameters not.synthesised
and cont.na
might need to be provided.
## S3 method for class 'synds' utility.gen(object, data, method = "cart", maxorder = 1, k.syn = FALSE, tree.method = "rpart", max.params = 400, print.stats = c("pMSE", "S_pMSE"), resamp.method = NULL, nperms = 50, cp = 1e-3, minbucket = 5, mincriterion = 0, vars = NULL, aggregate = FALSE, maxit = 200, ngroups = NULL, print.flag = TRUE, print.every = 10, digits = 6, print.zscores = FALSE, zthresh = 1.6, print.ind.results = FALSE, print.variable.importance = FALSE, ...) ## S3 method for class 'data.frame' utility.gen(object, data, not.synthesised = NULL, cont.na = NULL, method = "cart", maxorder = 1, k.syn = FALSE, tree.method = "rpart", max.params = 400, print.stats = c("pMSE", "S_pMSE"), resamp.method = NULL, nperms = 50, cp = 1e-3, minbucket = 5, mincriterion = 0, vars = NULL, aggregate = FALSE, maxit = 200, ngroups = NULL, print.flag = TRUE, print.every = 10, digits = 6, print.zscores = FALSE, zthresh = 1.6, print.ind.results = FALSE, print.variable.importance = FALSE, ...) ## S3 method for class 'list' utility.gen(object, data, not.synthesised = NULL, cont.na = NULL, method = "cart", maxorder = 1, k.syn = FALSE, tree.method = "rpart", max.params = 400, print.stats = c("pMSE", "S_pMSE"), resamp.method = NULL, nperms = 50, cp = 1e-3, minbucket = 5, mincriterion = 0, vars = NULL, aggregate = FALSE, maxit = 200, ngroups = NULL, print.flag = TRUE, print.every = 10, digits = 6, print.zscores = FALSE, zthresh = 1.6, print.ind.results = FALSE, print.variable.importance = FALSE, ...) ## S3 method for class 'utility.gen' print(x, digits = NULL, zthresh = NULL, print.zscores = NULL, print.stats = NULL, print.ind.results = NULL, print.variable.importance = NULL, ...)
## S3 method for class 'synds' utility.gen(object, data, method = "cart", maxorder = 1, k.syn = FALSE, tree.method = "rpart", max.params = 400, print.stats = c("pMSE", "S_pMSE"), resamp.method = NULL, nperms = 50, cp = 1e-3, minbucket = 5, mincriterion = 0, vars = NULL, aggregate = FALSE, maxit = 200, ngroups = NULL, print.flag = TRUE, print.every = 10, digits = 6, print.zscores = FALSE, zthresh = 1.6, print.ind.results = FALSE, print.variable.importance = FALSE, ...) ## S3 method for class 'data.frame' utility.gen(object, data, not.synthesised = NULL, cont.na = NULL, method = "cart", maxorder = 1, k.syn = FALSE, tree.method = "rpart", max.params = 400, print.stats = c("pMSE", "S_pMSE"), resamp.method = NULL, nperms = 50, cp = 1e-3, minbucket = 5, mincriterion = 0, vars = NULL, aggregate = FALSE, maxit = 200, ngroups = NULL, print.flag = TRUE, print.every = 10, digits = 6, print.zscores = FALSE, zthresh = 1.6, print.ind.results = FALSE, print.variable.importance = FALSE, ...) ## S3 method for class 'list' utility.gen(object, data, not.synthesised = NULL, cont.na = NULL, method = "cart", maxorder = 1, k.syn = FALSE, tree.method = "rpart", max.params = 400, print.stats = c("pMSE", "S_pMSE"), resamp.method = NULL, nperms = 50, cp = 1e-3, minbucket = 5, mincriterion = 0, vars = NULL, aggregate = FALSE, maxit = 200, ngroups = NULL, print.flag = TRUE, print.every = 10, digits = 6, print.zscores = FALSE, zthresh = 1.6, print.ind.results = FALSE, print.variable.importance = FALSE, ...) ## S3 method for class 'utility.gen' print(x, digits = NULL, zthresh = NULL, print.zscores = NULL, print.stats = NULL, print.ind.results = NULL, print.variable.importance = NULL, ...)
object |
it can be an object of class |
data |
the original (observed) data set. |
not.synthesised |
a vector of variable names for any variables that has
been left unchanged in the synthetic data. Not required if oject is of
class |
cont.na |
a named list of codes for missing values for continuous
variables if different from the |
method |
a single string specifying the method for modeling the propensity
scores. Method can be selected from |
maxorder |
maximum order of interactions to be considered in
|
k.syn |
a logical indicator as to whether the sample size itself has been synthesised. |
tree.method |
implementation of |
max.params |
the maximum number of parameters for a |
print.stats |
statistics to be printed must be a selection from
|
resamp.method |
method used for resampling estimates of standardized
measures can be |
nperms |
number of permutations for the permutation test to obtain the
null distribution of the utility measure when |
cp |
complexity parameter for classification with tree.method
|
minbucket |
minimum number of observations allowed in a leaf for
classification when |
mincriterion |
criterion between 0 and 1 to use to control
|
vars |
variables to be included in the utility comparison. It can be a character vector of names of variables or an integer vector of their column indices. If none are specified all the variables in the synthesised data will be included. |
aggregate |
logical flag as to whether the data should be aggregated by
collapsing identical rows before computation. This can lead to much faster
computation when all the variables are categorical. Only works for
|
maxit |
maximum iterations to use when |
ngroups |
target number of groups for categorisation of each numeric
variable: final number may differ if there are many repeated values. If
|
print.flag |
TRUE/FALSE to indicate if any messages should be printed during calculations. Change to FALSE for simulations. |
print.every |
controls the printing of progress of resampling when
|
... |
|
x |
an object of class |
digits |
number of digits to print in the default output values. |
zthresh |
threshold value to use to suppress the printing of z-scores
under |
print.zscores |
logical value as to whether z-scores for coefficients of the logit model should be printed. |
print.ind.results |
logical value as to whether utility score results from individual syntheses should be printed. |
print.variable.importance |
logical value as to whether the variable
importance measure should be printed when |
This function follows the method for evaluating the utility of masked data as given in Snoke et al. (2018) and originally proposed by Woo et al. (2009). The original and synthetic data are combined into one dataset and propensity scores, as detailed in Rosenbaum and Rubin (1983), are calculated to estimate the probability of membership in the synthetic data set. The utility measure is based on the mean squared difference between these probabilities and the probability expected if the data did not distinguish the synthetic data from the original.
If k.syn = FALSE
the expected probability is just the proportion of
synthetic data in the combined data set, 0.5
when the original and
synthetic data have the same number of records. Setting k.syn = TRUE
indicates that the numbers of observations in the synthetic data was
synthesised and not fixed by the synthesiser. In this case the expected
probability will be 0.5
in all cases and the model to discriminate
between observed and synthetic will include an intercept term. This will
usually only apply when the standalone version of this function
utility.gen.sa()
is used.
Propensity scores can be modeled by logistic regression method = "logit"
or by two different implementations of classification and regression trees as
method "cart"
. For logistic regression the predictors are all variables
in the data and their interactions up to order maxorder
. The default of
1
gives all main effects and first order interactions. For logistic
regression the null distribution of the propensity score is derived and is
used to calculate ratios and standardised values.
For method = "cart"
the expectation and variance of the null
distribution is calculated from a permutation test. Our recent work
indicates that this method can sometimes give misleading results.
If missing values exist, indicator variables are added and included in the
model as recommended by Rosenbaum and Rubin (1984). For categorical variables,
NA
is treated as a new category.
An object of class utility.gen
which is a list including the utility
measures their expected null values for each synthetic set with the following
components:
call |
the call that produced the result. |
m |
number of synthetic data sets in object. |
method |
method used to fit propensity score. |
tree.method |
cart function used to fit propensity score when
|
resamp.method |
type of resampling used to get |
maxorder |
see above. |
vars |
see above. |
nfix |
see above. |
aggregate |
see above. |
maxit |
see above. |
ngroups |
see above. |
df |
degrees of freedom for the chi-squared test for logit models
derived from the number of non-aliased coefficients in the logistic model,
minus |
mincriterion |
see above. |
nperms |
see above. |
incomplete |
TRUE/FALSE indicator if any of the variables being compared are not synthesised. |
pMSE |
propensity score mean square error from the utility model or a
vector of these values if |
S_pMSE |
ratio(s) of |
PO50 |
percentage over 50% of each synthetic data set where the model used correctly predicts whether real or synthetic. |
S_PO50 |
ratio(s) of |
SPECKS |
Kolmogorov-Smirnov statistic to compare the propensity scores for the original and synthetic records. |
S_SPECKS |
ratio(s) of |
print.stats |
see above. |
fit |
the fitted model for the propensity score or a list of fitted
models of length |
nosplits |
for resampling methods and cart models, a list of the number of times from the total each resampled cart model failed to select any splits to classify the indicator. Indicates that this method is not working correctly and results should not be used but a logit model selected instead. |
digits |
see above. |
print.ind.results |
see above. |
print.zscores |
see above. |
zthresh |
see above. |
print.variable.importance |
see above. |
Woo, M-J., Reiter, J.P., Oganian, A. and Karr, A.F. (2009). Global measures of data utility for microdata masked for disclosure limitation. Journal of Privacy and Confidentiality, 1(1), 111-124.
Rosenbaum, P.R. and Rubin, D.B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79(387), 516-524.
Snoke, J., Raab, G.M., Nowok, B., Dibben, C. and Slavkovic, A. (2018). General and specific utility measures for synthetic data. Journal of the Royal Statistical Society: Series A, 181, Part 3, 663-688.
## Not run: ods <- SD2011[1:1000, c("age", "bmi", "depress", "alcabuse", "nofriend")] s1 <- syn(ods, m = 5, method = "parametric", cont.na = list(nofriend = -8)) ### synthetic data provided as a 'synds' object u1 <- utility.gen(s1, ods) print(u1, print.zscores = TRUE, zthresh = 1, digits = 6) u2 <- utility.gen(s1, ods, ngroups = 3, print.flag = FALSE) print(u2, print.zscores = TRUE) u3 <- utility.gen(s1, ods, method = "cart", nperms = 20) print(u3, print.variable.importance = TRUE) ### synthetic data provided as 'list' utility.gen(s1$syn, ods, cont.na = list(nofriend = -8)) ## End(Not run)
## Not run: ods <- SD2011[1:1000, c("age", "bmi", "depress", "alcabuse", "nofriend")] s1 <- syn(ods, m = 5, method = "parametric", cont.na = list(nofriend = -8)) ### synthetic data provided as a 'synds' object u1 <- utility.gen(s1, ods) print(u1, print.zscores = TRUE, zthresh = 1, digits = 6) u2 <- utility.gen(s1, ods, ngroups = 3, print.flag = FALSE) print(u2, print.zscores = TRUE) u3 <- utility.gen(s1, ods, method = "cart", nperms = 20) print(u3, print.variable.importance = TRUE) ### synthetic data provided as 'list' utility.gen(s1$syn, ods, cont.na = list(nofriend = -8)) ## End(Not run)
Produces tables from observed and synthesised data and calculates utility measures to compare them with their expectation if the synthesising model is correct.
It can be also used with synthetic data NOT created by syn()
,
but then an additional parameter cont.na
might need to be provided.
## S3 method for class 'synds' utility.tab(object, data, vars = NULL, ngroups = 5, useNA = TRUE, max.table = 1e6, print.tables = length(vars) < 4, print.stats = c("pMSE", "S_pMSE", "df"), print.zdiff = FALSE, print.flag = TRUE, digits = 4, k.syn = FALSE, ...) ## S3 method for class 'data.frame' utility.tab(object, data, vars = NULL, cont.na = NULL, ngroups = 5, useNA = TRUE, max.table = 1e6, print.tables = length(vars) < 4, print.stats = c("pMSE", "S_pMSE", "df"), print.zdiff = FALSE, print.flag = TRUE, digits = 4, k.syn = FALSE, ...) ## S3 method for class 'list' utility.tab(object, data, vars = NULL, cont.na = NULL, ngroups = 5, useNA = TRUE, max.table = 1e6, print.tables = length(vars) < 4, print.stats = c("pMSE", "S_pMSE", "df"), print.zdiff = FALSE, print.flag = TRUE, digits = 4, k.syn = FALSE, ...) ## S3 method for class 'utility.tab' print(x, print.tables = NULL, print.zdiff = NULL, print.stats = NULL, digits = NULL, ...)
## S3 method for class 'synds' utility.tab(object, data, vars = NULL, ngroups = 5, useNA = TRUE, max.table = 1e6, print.tables = length(vars) < 4, print.stats = c("pMSE", "S_pMSE", "df"), print.zdiff = FALSE, print.flag = TRUE, digits = 4, k.syn = FALSE, ...) ## S3 method for class 'data.frame' utility.tab(object, data, vars = NULL, cont.na = NULL, ngroups = 5, useNA = TRUE, max.table = 1e6, print.tables = length(vars) < 4, print.stats = c("pMSE", "S_pMSE", "df"), print.zdiff = FALSE, print.flag = TRUE, digits = 4, k.syn = FALSE, ...) ## S3 method for class 'list' utility.tab(object, data, vars = NULL, cont.na = NULL, ngroups = 5, useNA = TRUE, max.table = 1e6, print.tables = length(vars) < 4, print.stats = c("pMSE", "S_pMSE", "df"), print.zdiff = FALSE, print.flag = TRUE, digits = 4, k.syn = FALSE, ...) ## S3 method for class 'utility.tab' print(x, print.tables = NULL, print.zdiff = NULL, print.stats = NULL, digits = NULL, ...)
object |
an object of class |
data |
the original (observed) data set. |
vars |
a single string or a vector of strings with the names of variables to be used to form the table. |
cont.na |
a named list of codes for missing values for continuous
variables if different from the |
max.table |
a maximum table size. You could try increasing the default value, but memory problems are likely. |
ngroups |
if numerical (non-factor) variables are included they will be
classified into this number of groups to form tables. Classification is
performed using |
useNA |
determines if NA values are to be included in tables. |
print.tables |
a logical value that determines if tables of observed and synthesised data are to be printed. By default tables are printed if they have up to three dimensions. |
print.stats |
a single string or a vector of strings that determines
which utility measures to print. Must be a selection from:
|
print.zdiff |
a logical value that determines if tables of Z scores for differences between observed and expected are to be printed. |
print.flag |
a logical value that determines if messages are to be printed during computation. |
digits |
an integer indicating the number of decimal places for printing
statistics, |
k.syn |
a logical indicator as to whether the sample size itself has
been synthesised. The default value is |
... |
additional parameters; can be passed to classIntervals() function. |
x |
an object of class |
Forms tables of observed and synthesised values for the variables
specified in vars
. Several utility measures are calculated from the cells
of the tables, as described below. Details of all of these measures can be found
in Raab et al. (2021). If the synthesising model is correct the measures
VW
, FT
, G
and JSD
should have chi-square distributions
with df
degrees of freedom for large samples. Standardised versions of each
measure are available (e.g. S_VW
for VW
, where S_VW = VW/df
)
that will have an expected value of 1
if the synthesising model is correct.
Four other measures are calculated by considering the table as a prediction model.
The propensity score mean-squared error pMSE
, and from a comparison of
propensity scores for the synthetic and original data the Kolmogorov-Smirnov
statistic SPECKS
and the Wilcoxon rank-sum statistic U
and also
the percentage of the observations correctly predicted in the combined tables over
50%(PO50
) where the majority of observations in each grouping are in
agreement with category (real or synthetic) of the observation. The first of these
pMSE
is identical except for a constant to VW
. No expected values are
computed for the last three of these measures, but they can be obtained by replication
from utility.gen()
.
Three further measures are calulated from the tables. The mean absolute difference
in distributions: firstly MabsDD
, the avarage absolute difference in the
proportions of original and synthetic data from all the cells in the table.
Secondly a weighted version of this measure WMabsDD
where the weights are
proportional to the inverse of the variance of the absolute differences so that
this measure can be standardised by its expected value, df
. Finally the
Bhattacharyya distances BhattD
derived from the overlap of the histograms
of the original and synthetic data sets.
An object of class utility.tab
which is a list with the following
components:
m |
number of synthetic data sets in object, i.e. |
VW |
a vector with |
FT |
a vector with |
JSD |
a vector with |
SPECKS |
a vector with |
WMabsDD |
a vector with |
U |
a vector with |
G |
a vector with |
pMSE |
a vector with |
PO50 |
a vector with |
MabsDD |
a vector with |
dBhatt |
a vector with |
S_VW |
|
S_FT |
|
S_JSD |
|
S_WMabsDD |
WMabsDD/df. |
S_G |
|
S_pMSE |
standardised measure from |
df |
a vector of degrees of freedom for the chi-square tests which equal
to the number of cells in the tables with any observed or
synthesised counts minus one when |
dfG |
degrees of freedom used in standardising |
nempty |
a vector of length |
tab.obs |
a table from the observed data. |
tab.syn |
a table or a list of |
tab.zdiff |
a table or a list of |
digits |
an integer indicating the number of decimal places
for printing statistics, |
print.tables |
a logical value that determines if tables of observed and synthesised are to be printed. |
print.stats |
a single string or a vector of strings with utility measures to be printed out. |
print.zdiff |
a logical value that determines if tables of Z scores for differences between observed and expected are to be printed. |
n |
number of observation in the original dataset. |
k.syn |
a logical indicator as to whether the sample size itself has been synthesised. |
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. doi:10.18637/jss.v074.i11.
Raab, G.M., Nowok, B. and Dibben, C. (2021). Assessing, visualizing and improving the utility of synthetic data. Available from https://arxiv.org/abs/2109.12717.
Read, T.R.C. and Cressie, N.A.C. (1988) Goodness–of–Fit Statistics for Discrete Multivariate Data, Springer–Verlag, New York.
Voas, D. and Williamson, P. (2001) Evaluating goodness-of-fit measures for synthetic microdata. Geographical and Environmental Modelling, 5(2), 177-200.
ods <- SD2011[1:1000, c("sex", "age", "marital", "nofriend")] s1 <- syn(ods, m = 10, cont.na = list(nofriend = -8)) utility.tab(s1, ods, vars = c("marital", "sex"), print.stats = "all") s2 <- syn(ods, m = 1, cont.na = list(nofriend = -8)) u2 <- utility.tab(s2, ods, vars = c("marital", "age", "sex"), ngroups = 3) print(u2, print.tables = TRUE, print.zdiff = TRUE) ### synthetic data provided as 'data.frame' utility.tab(s2$syn, ods, vars = c("marital", "nofriend"), ngroups = 3, print.tables = TRUE, cont.na = list(nofriend = -8), digits = 4)
ods <- SD2011[1:1000, c("sex", "age", "marital", "nofriend")] s1 <- syn(ods, m = 10, cont.na = list(nofriend = -8)) utility.tab(s1, ods, vars = c("marital", "sex"), print.stats = "all") s2 <- syn(ods, m = 1, cont.na = list(nofriend = -8)) u2 <- utility.tab(s2, ods, vars = c("marital", "age", "sex"), ngroups = 3) print(u2, print.tables = TRUE, print.zdiff = TRUE) ### synthetic data provided as 'data.frame' utility.tab(s2$syn, ods, vars = c("marital", "nofriend"), ngroups = 3, print.tables = TRUE, cont.na = list(nofriend = -8), digits = 4)
Calculates and plots tables of utility measures. The calculations of
utility measures are done by the function utility.tab
.
Options are all one-way tables, all two-way tables or three-way tables
for a specified third variable along with pairs of all other variables.
This function can be also used with synthetic data NOT created by
syn()
, but then an additional parameters not.synthesised
and cont.na
might need to be provided.
## S3 method for class 'synds' utility.tables(object, data, tables = "twoway", maxtables = 5e4, vars = NULL, third.var = NULL, useNA = TRUE, ngroups = 5, tab.stats = c("pMSE", "S_pMSE", "df"), plot.stat = "S_pMSE", plot = TRUE, print.tabs = FALSE, digits.tabs = 4, max.scale = NULL, min.scale = 0, plot.title = NULL, nworst = 5, ntabstoprint = 0, k.syn = FALSE, low = "grey92", high = "#E41A1C", n.breaks = NULL, breaks = NULL, ...) ## S3 method for class 'data.frame' utility.tables(object, data, cont.na = NULL, not.synthesised = NULL, tables = "twoway", maxtables = 5e4, vars = NULL, third.var = NULL, useNA = TRUE, ngroups = 5, tab.stats = c("pMSE", "S_pMSE", "df"), plot.stat = "S_pMSE", plot = TRUE, print.tabs = FALSE, digits.tabs = 4, max.scale = NULL, min.scale = 0, plot.title = NULL, nworst = 5, ntabstoprint = 0, k.syn = FALSE, low = "grey92", high = "#E41A1C", n.breaks = NULL, breaks = NULL, ...) ## S3 method for class 'list' utility.tables(object, data, cont.na = NULL, not.synthesised = NULL, tables = "twoway", maxtables = 5e4, vars = NULL, third.var = NULL, useNA = TRUE, ngroups = 5, tab.stats = c("pMSE", "S_pMSE", "df"), plot.stat = "S_pMSE", plot = TRUE, print.tabs = FALSE, digits.tabs = 4, max.scale = NULL, min.scale = 0, plot.title = NULL, nworst = 5, ntabstoprint = 0, k.syn = FALSE, low = "grey92", high = "#E41A1C", n.breaks = NULL, breaks = NULL, ...) ## S3 method for class 'utility.tables' print(x, print.tabs = NULL, digits.tabs = NULL, plot = NULL, plot.title = NULL, max.scale = NULL, min.scale = NULL, nworst = NULL, ntabstoprint = NULL, ...)
## S3 method for class 'synds' utility.tables(object, data, tables = "twoway", maxtables = 5e4, vars = NULL, third.var = NULL, useNA = TRUE, ngroups = 5, tab.stats = c("pMSE", "S_pMSE", "df"), plot.stat = "S_pMSE", plot = TRUE, print.tabs = FALSE, digits.tabs = 4, max.scale = NULL, min.scale = 0, plot.title = NULL, nworst = 5, ntabstoprint = 0, k.syn = FALSE, low = "grey92", high = "#E41A1C", n.breaks = NULL, breaks = NULL, ...) ## S3 method for class 'data.frame' utility.tables(object, data, cont.na = NULL, not.synthesised = NULL, tables = "twoway", maxtables = 5e4, vars = NULL, third.var = NULL, useNA = TRUE, ngroups = 5, tab.stats = c("pMSE", "S_pMSE", "df"), plot.stat = "S_pMSE", plot = TRUE, print.tabs = FALSE, digits.tabs = 4, max.scale = NULL, min.scale = 0, plot.title = NULL, nworst = 5, ntabstoprint = 0, k.syn = FALSE, low = "grey92", high = "#E41A1C", n.breaks = NULL, breaks = NULL, ...) ## S3 method for class 'list' utility.tables(object, data, cont.na = NULL, not.synthesised = NULL, tables = "twoway", maxtables = 5e4, vars = NULL, third.var = NULL, useNA = TRUE, ngroups = 5, tab.stats = c("pMSE", "S_pMSE", "df"), plot.stat = "S_pMSE", plot = TRUE, print.tabs = FALSE, digits.tabs = 4, max.scale = NULL, min.scale = 0, plot.title = NULL, nworst = 5, ntabstoprint = 0, k.syn = FALSE, low = "grey92", high = "#E41A1C", n.breaks = NULL, breaks = NULL, ...) ## S3 method for class 'utility.tables' print(x, print.tabs = NULL, digits.tabs = NULL, plot = NULL, plot.title = NULL, max.scale = NULL, min.scale = NULL, nworst = NULL, ntabstoprint = NULL, ...)
object |
an object of class |
data |
the original (observed) data set. |
cont.na |
a named list of codes for missing values for continuous
variables if different from the |
not.synthesised |
a vector of variable names for any variables that has been left unchanged in the synthetic data. |
tables |
defines the type of tables to produce. Options are
|
maxtables |
maximum number of tables that will be produced. If number of
tables is larger, then utility is only measured for a sample of size
|
.
vars |
a vector of strings with the names of variables to be used to form the table, or a vector of variable numbers in the original data. Defaults to all variables in both original and synthetic data. |
third.var |
when |
useNA |
determines if |
ngroups |
if numerical (non-factor) variables included with
|
tab.stats |
statistics to include in the table of results. Must be
a selection from: |
plot.stat |
statistics to plot. Choice is |
plot |
determines if plot will be produced when the result is printed. |
print.tabs |
logical value that determines if table of results is to be printed. |
digits.tabs |
number of digits to print for table, except for p-values that are always printed to 4 places. |
max.scale |
a numeric value for the maximum value used in calculating
the shading of the plots. If it is |
min.scale |
a numeric value for the minimum value used in calculating
the shading of the plots. If it is |
plot.title |
title for the plot. |
nworst |
a number of variable combinations with worst utility scores to be printed. |
ntabstoprint |
a number of tables to print for observed and synthetic data with the worst utility. |
k.syn |
a logical indicator as to whether the sample size itself has been synthesised. |
low |
colour for low end of the gradient. |
high |
colour for high end of the gradient. |
n.breaks |
a number of break points to create if breaks are not given directly. |
breaks |
breaks for a two colour binned gradient. |
... |
additional parameters |
x |
an object of class |
Calculates tables of observed and synthesised values for the variables
specified in vars
with the function utility.tab
and produces
tables and plots of one-way, two-way or
three-way utility measures formed from vars
. Several options for utility
measures can be selected for printing or plotting. Details are in help file
for utility.tab
.
The tables and variables with the worst utility scores are identified. Visualisations of the matrices of utility scores are plotted. For threeway tables a third variable can be defined to select all tables involving that variable for plotting. If it is not specified the variable with tables giving the worst utility is selected as the third variable.
An object of class utility.tab
which is a list with the following
components:
tabs |
a table with all the selected measures for all combinations of
variables defined by |
plot.stat |
measure used in |
tables |
see above. |
third.var |
see above. |
utility.plot |
plot of the selected utility measure. |
var.scores |
an average of utility scores for all combinations with other variables. |
plot |
see above. |
print.tabs |
see above. |
digits.tabs |
see above. |
plot.title |
see above. |
max.scale |
see above. |
min.scale |
see above. |
ntabstoprint |
see above. |
nworst |
see above. |
worstn |
variable combinations with |
worsttabs |
observed and synthetic cross-tabulations for |
Read, T.R.C. and Cressie, N.A.C. (1988) Goodness–of–Fit Statistics for Discrete Multivariate Data, Springer–Verlag, New York.
Voas, D. and Williamson, P. (2001) Evaluating goodness-of-fit measures for synthetic microdata. Geographical and Environmental Modelling, 5(2), 177-200.
ods <- SD2011[1:1000, c("sex", "age", "edu", "marital", "region", "income")] s1 <- syn(ods) ### synthetic data provided as a 'synds' object (t1 <- utility.tables(s1, ods, tab.stats = "all", print.tabs = TRUE)) ### synthetic data provided as a 'data.frame' object (t1 <- utility.tables(s1$syn, ods, tab.stats = "all", print.tabs = TRUE)) t2 <- utility.tables(s1, ods, tables = "twoway") print(t2, max.scale = 3) (t3 <- utility.tables(s1, ods, tab.stats = "all", tables = "threeway", third.var = "sex", print.tabs = TRUE)) (t4 <- utility.tables(s1, ods, tab.stats = "all", tables = "threeway", third.var = "sex", useNA = FALSE, print.tabs = TRUE)) (t5 <- utility.tables(s1, ods, tab.stats = "all", print.tabs = TRUE))
ods <- SD2011[1:1000, c("sex", "age", "edu", "marital", "region", "income")] s1 <- syn(ods) ### synthetic data provided as a 'synds' object (t1 <- utility.tables(s1, ods, tab.stats = "all", print.tabs = TRUE)) ### synthetic data provided as a 'data.frame' object (t1 <- utility.tables(s1$syn, ods, tab.stats = "all", print.tabs = TRUE)) t2 <- utility.tables(s1, ods, tables = "twoway") print(t2, max.scale = 3) (t3 <- utility.tables(s1, ods, tab.stats = "all", tables = "threeway", third.var = "sex", print.tabs = TRUE)) (t4 <- utility.tables(s1, ods, tab.stats = "all", tables = "threeway", third.var = "sex", useNA = FALSE, print.tabs = TRUE)) (t5 <- utility.tables(s1, ods, tab.stats = "all", print.tabs = TRUE))
Exports synthetic data set(s) from synthesised data set
(synds
) object to external files of selected format.
Currently supported file formats include: SPSS, Stata, SAS, csv, tab,
rda, RData and txt. For SPSS, Stata and SAS it uses functions from
the foreign
package with some adjustments where necessary.
Information about the synthesis is written into a separate text file.
NOTE: Currently numeric codes and labels can be preserved correctly only
for SPSS files imported into R using read.obs
function.
write.syn(object, filename, filetype = c("SPSS", "Stata", "SAS", "csv", "tab", "rda", "RData", "txt"), convert.factors = "numeric", data.labels = NULL, save.complete = TRUE, extended.info = TRUE, ...)
write.syn(object, filename, filetype = c("SPSS", "Stata", "SAS", "csv", "tab", "rda", "RData", "txt"), convert.factors = "numeric", data.labels = NULL, save.complete = TRUE, extended.info = TRUE, ...)
object |
an object of class |
filename |
the name of the file (excluding extension) which the
synthetic data are to be written into. For multiple synthetic data sets
it will be used as a prefix folowed respectively by |
filetype |
a desired format of the output files. |
convert.factors |
a single string indicating how to handle factors in
Stata output files. The default value is set to |
data.labels |
a list with variable labels and value labels. |
save.complete |
a logical value indicating whether a complete
'synthesised data set' ( |
extended.info |
a logical value indicating whether extended information should be saved into an information file. |
... |
additional parameters passed to write functions. |
File(s) with synthesised data set(s) and a text file with information
about synthesis are produced. Optionally a complete synthesised data set
object is saved into synobject_filename.RData
file.