Abstract:
The integration of disparate large and heterogeneous socioeconomic and clinical databases is now considered essential to capture and model longitudinal and social aspects of diseases. However, such integration has significant challenges associated with it. Databases are often stored in disparate locations, make use of different identifiers, have variable data quality, record information in bespoke purpose-specific formats and have different levels of associated metadata. Novel computational methods are required to integrate such databases and enable their statistical analyses for clinical research purposes. In this paper, we describe a probabilistic approach for constructing a very large population-based cohort comprised of 114 million individuals using linkages between clinical databases from the National Health System and other administrative databases from various government entities in order to facilitate epidemiological research. We discuss and evaluate the design and validation of our data integration model and probabilistic data linkage methods for creating research data marts that can be statistically analyzed.


