Reporting on the event dedicated to synthetic data at CBS
2023/06/13
Clémentine Cottineau
Symposium Synthetische Data
Synthetic data simulate the joint characteristics of the relationships between people and objects (e.g. a school or a neighbourhood), thus allowing the simulation of reality without identifying the person or object. Synthetic data can be generated by an algorithm or a computer simulation. The advantage of synthetic data is that, depending on the purpose of the user, a trade-off is made between the analytical value of the dataset and the disclosure risk
— cbs.nl
On 1st June 2023, CBS organised a symposium on synthetic data (in Dutch) in the Hague to gather producers and potential users of synthetic data. The objective was to:
-
inform the audience about where the Netherlands in general and CBS in particular are standing with respect to synthetic data (on the juridical, ethical and technical levels).
-
present the opportunities of synthetic data for research and policy based on examples being run in collaboration with universities and public bodies from the health and education sectors.
-
facilitate the exchanges between users and providers of synthetic data by inviting synthetic data software developers to pitch their solutions in the form of posters.
-
discuss the societal implications of synthetic data (trust in public data, impact of legal prohibition of deep fake, etc).
What I found very interesting is the fact that CBS is currently testing solutions and actively working on the production of a synthetic version of their microdata. Indeed, the quality and breath of CBS data is as precious as it is practically inconvenient to work in a secure remote environment on sensitive individual data. With synthetic data, data analysis could be made much more practical and open for social researchers, and potentially enable them to share the raw synthetic data (not as good as the raw original data but better than aggregates).
Additionally, Barteld Braaksma (innovation manager at CBS) announced that they were considering a wide range of uses for their synthetic data, ranging from test datasets to training sets for machine learning to inputs of agent-based modelling. The latter is the use we would make of CBS synthetic data is SEGUE, provided that the population generation is spatially-explicit, i.e. respects spatial distributions as well as cross-distributions of socioeconomic attributes. Of course, such a multivariate and spatially-explicit population is the closest to a replica of the original population, and therefore the one with the highest disclosure risk. We are looking forward to know more about this initiative to see if we could use the results in our project.
In SEGUE, we are aiming to build an agent-based simulation of urban economic segregation in the Netherlands. To make this agent-based model run, we need
-
an initial population (t~0);
-
a set of mechanisms to determine how things change over time (how agents decide to change residence, to form a household, to pick a school, etc.).
Selecting and combining the set of mechanisms generating urban economic segregation over time is the core of our project, whereas the initial population is a technical requirement that determines the resemblance to the target system (i.e. the population of the Netherlands). A synthetic version of the CBS microdata produced by CBS would be a great choice if available.