Modern studies of societal phenomena rely on the availability of large datasets capturing attributes and
activities of synthetic, city-level, populations. For instance, in epidemiology, synthetic population datasets are
necessary to study disease propagation and intervention measures before implementation. In social science,
synthetic population datasets are needed to understand how policy decisions might affect preferences and
behaviors of individuals. In public health, synthetic population datasets are necessary to capture diagnostic
and procedural characteristics of patient records without violating confidentialities of individuals. To generate
such datasets over a large set of categorical variables, we propose the use of the maximum entropy principle
to formalize a generative model such that in a statistically well-founded way we can optimally utilize given
prior information about the data, and are unbiased otherwise. An e�cient inference algorithm is designed
to estimate the maximum entropy model, and we demonstrate how our approach is adept at estimating
underlying data distributions. We evaluate this approach against both simulated data and US census datasets,
and demonstrate its feasibility using an epidemic simulation application.
History
Preferred Citation
Hao Wu, Yue Ning, Prithwish Chakraborty, Jilles Vreeken, Nikolaj Tatti and Naren Ramakrishnan. Generating Realistic Synthetic Population Datasets. In: ACM Transactions on Knowledge Discovery and Data Mining. 2018.
Primary Research Area
Trustworthy Information Processing
Legacy Posted Date
2019-06-07
Journal
ACM Transactions on Knowledge Discovery and Data Mining
Pages
1 - 45
Open Access Type
Unknown
Sub Type
Article
BibTeX
@article{cispa_all_2912,
title = "Generating Realistic Synthetic Population Datasets",
author = "Wu, Hao and Ning, Yue and Chakraborty, Prithwish and Vreeken, Jilles and Tatti, Nikolaj and Ramakrishnan, Naren",
journal="{ACM Transactions on Knowledge Discovery and Data Mining}",
year="2018",
}