Session topic: What makes great data documentation?Documentation is the tool that describes how and why a database was created, what its strengths and limitations are, and how all of the various components fit together. As such, it is an invaluable resource for helping others understand what they can do with the data. Please join us for a discussion of what makes great data documentation.
This session will begin with a 40-minute integrated presentation by the Manitoba Centre for Health Policy (MCHP) and the Institute for Clinical Evaluative Sciences (ICES), two of the leading population data research centres in Canada, covering the following topics:
1 – Structured Overviews
2 – Data Models
3 – Data Dictionaries
4 – Other Documentation and Published Reports
5 – Integrating Blog or Analyst Notes
6 – Data Quality Reporting
VIMO tables
Heat maps
Trend analysis
Relevancy
Session Facilitators:Mahmoud Azimaee, Institute for Clinical Evaluative Sciences (ICES)
Mark Smith, Manitoba Centre for Health Policy (MCHP)
The Intended Outcome:A research paper based on the discussion for publication in the International Journal of Population Data Science (IJPDS). All participants are invited to join us as co-authors in drafting and revising the paper.
The objective of this project is to implement a harmonized artificial intelligence (AI)-based de-identification of free-text medical data across multiple Canadian jurisdictions. This federated learning approach will allow these jurisdictions to leverage each other's data and resources while no individual-level data leaves the jurisdiction.
Federated Learning enables health data centers in different jurisdictions to collaborate in training machine learning models without sharing individual-level data. This approach will significantly reduce privacy and cybersecurity risks and barriers that are involved in sharing and moving data across different jurisdictions.
In a federated learning environment, machine learning models are trained on multiple data sources available in local data centers; local data are not shared to a central computing/analysis environment. Instead, parameters (such as model weights) are shared between these local data centers to generate a global model that will be shared and used by all participating data centers.
In this case study, four health research data centers in different Canadian provinces will take part in deployment of an AI-based application for de-identification of free-text data. The data centers are members of Health Data Research Network (HDRN) Canada. The deployment will include:
harmonized annotation and labeling of local data,
local training of entity recognition algorithms,
integrating model weights from each data centers to create a global model
development of license agreements between the participating data centers to allow sharing model weights
This is an ongoing project. The talk will demonstrate learning experiences, advantages, and challenges in a federated learning environment and explore the feasibility of transporting this approach to other multi-jurisdiction research networks.
IntroductionDue to the ever-growing volume and complexity of clinical data, it has become a tedious task to extract information from data for secondary uses such as decision support, quality assurance, and outcome analysis. Recently, there have been great advances in Natural Language Processing (NLP) approaches that automate knowledge extraction from clinical reports in order to save costs and improve efficiency.
Objectives/ApproachOur goal is the development of an NLP tool designed to automatically extract and encode clinical information from laboratory reports. This study describes and evaluates our NLP tool on provincial repositories of laboratory tests and results called Ontario Laboratory Information System (OLIS). OLIS is an electronic system that covers >200 labs and stores patients' current and past test results as patients move through different areas of the healthcare system. Our NLP tool is a modular system of pipelined components including Named Entity Recognition module for extracting mentions of virus and test mentions and inference to combine extracted entities into a meaningful outcome.
ResultsInitial analyses were conducted on a segment of OLIS related to laboratory tests for respiratory viruses. This data included over a million observations corresponding to ~100 Logical Observation Identifiers Names and Codes (LOINC), with >40,000 unique strings. The clinical text was cleaned, tokenized, and parsed using an in-house text algorithm that was continually refined with manual review from clinical experts. This data was then encoded as virus and test types to be used as a ground truth. The NLP tool was built on ground truth data and achieved an accuracy greater than 95%.
Conclusion/ImplicationsApproaches like these can be applied to many areas of health research that make use of clinical reports. Our methods, when optimized and validated, can be deployed into clinical systems to provide on-the-spot analysis of various laboratory reports.
Overall objectives or goalMost of the organizations that use population administrative data for research purposes have internal repository of validated definitions and algorithms of their own. Many of these concepts and definitions are applicable or at least adaptable to other organizations and jurisdictions. A comprehensive National (and potentially International) Concept Dictionary could help investigators to carry out methodologically sound work using consistent and validated algorithms using a shared pool of knowledge and resources.
The Institute for Clinical Evaluative Sciences (ICES) in Ontario, Canada has recently modernized its internal Concept Dictionary by adopting standard templates based on the Manitoba Centre for Health Policy (MCHP) Concept Dictionary, reviewing and updating existing content and tagging the concept entries with appropriate MeSH terms and data sources, and adding standard computer code (e.g., SAS coding) where appropriate. A SharePoint® web-based application has been developed to provide advanced tagging, searching and browsing features.
We envision a wiki-based Concept Dictionary hosted on a cloud-based environment with very granular access controls to provide enough flexibility for each participating organization to control their own content. This means each organization will be able to decide on how to share their own concepts (or part of them) with the public or internal users.
All content will be tagged with MeSH terms and as well with the organization's name that initially posts each entry. Other organizations which find the same concept applicable to their own use can tag the same entry with their organization name or refer to a secondary adapted entry if adaptation to fit their data and methodologies is required.
The Search feature will allow refining the search criteria by MeSH terms, data sources, and also organization/jurisdiction name.
Multiple layers of access controls will allow each organization to have their own groups of users with different standard privileges such as Local Administrators, Authors and Approvers (or Publishers).
The Approver (Publisher) users within each organization can publish each entry for internal or public view. This way, for example, a definition/algorithm can be viewable only within the organization until the validation process is complete, and then the entry can be made publically available, while some sections, such as computer code, can remain restricted to the organization.
We will discuss challenges in developing and maintaining such a platform including the costs, governance, intellectual property rights, copyrights and liabilities for the participating organizations.
The intended output or outcomeWe aim to use this opportunity to form a working group from the interested organizations that are ready to participate and commit in developing this collaborative platform. After the conference, there will be follow up sessions with the members of the working group to plan and develop the online application.
BackgroundCanadian health data repositories link datasets at the provincial level, based on their residents' registrations to provincial health insurance plans. Linking national datasets with provincial health care registries poses several challenges that may result in misclassification and impact the estimation of linkage rates. A recent linkage of a federal immigration database in the province of Manitoba illustrates these challenges. Objectivesa) To describe the linkage of the federal Immigration, Refugees and Citizenship Canada Permanent Resident (IRCC-PR) database with the Manitoba healthcare registry and b) compare data linkage methods and rates between four Canadian provinces accounting for interprovincial mobility of immigrants. MethodsWe compared linkage rates by immigrant's province of intended destination (province vs. rest of Canada). We used external nationwide immigrant tax filing records to approximate actual settlement and obtain linkage rates corrected for interprovincial mobility. ResultsThe immigrant linkage rates in Manitoba before and after accounting for interprovincial mobility were 84.8% and 96.1, respectively. Linkage rates did not substantially differ according to immigrants' characteristics, with a few exceptions. Observed linkage rates across the four provinces ranged from 74.0% to 86.7%. After correction for interprovincial mobility, the estimated linkage rates increased >10 percentage points for the provinces that stratified by intended destination (British Columbia and Manitoba) and decreased up to 18 percentage points for provinces that could not use immigration records of those who did not intend to settle in the province (New Brunswick and Ontario). ConclusionsDespite variations in methodology, provincial linkage rates were relatively high. The use of a national immigration dataset for linkage to provincial repositories allows a more comprehensive linkage than that of province-specific subsets. Observed linkage rates can be biased downwards by interprovincial migration, and methods that use external data sources can contribute to assessing potential selection bias and misclassification.
ABSTRACTObjectives Ontario, the most populous province in Canada, has a universal healthcare system that routinely collects health administrative data on its 13 million legal residents that is used for health research. Record linkage has become a vital tool for this research by enriching this data with the Immigration, Refugees and Citizenship Canada (IRCC) Permanent Resident database and the Office of the Registrar General's Vital Statistics-Death (VSD) registry. Our objectives were to estimate linkage rates and compare characteristics of individuals in the linked versus unlinked files.
Approach We used both deterministic and probabilistic linkage methods to link the IRCC database (1985-2012) and VSD registry (1990-2012) to the Ontario's Registered Persons Database. Linkage rates were estimated and standardized differences were used to assess differences in socio-demographic and other characteristics between the linked and unlinked records.
Results The overall linkage rates for the IRCC database and VSD registry were 86.4% and 96.2%, respectively. The majority (68.2%) of the record linkages in IRCC were achieved after the three deterministic passes with the remaining 18.2% being linked probabilistically. Similarly the majority (79.8%) of the record linkages in the ORGD were linked using deterministic record linkage and the remaining 16.3% were linked after probabilistic and manual review. Unlinked and linked files were similar for most characteristics, such as age and marital status for IRCC and sex and most causes of death for VSD. However, lower linkage rates were observed among people born in East Asia (78%) in the IRCC database and certain causes of death in the VSD registry, namely perinatal conditions (61.3%) and congenital anomalies (81.3%).
Conclusion The linkages of immigration and vital statistics data to existing population-based healthcare data in Ontario, Canada will enable many novel cross-sectional and longitudinal studies to be conducted. Analytic techniques to account for sub-optimal linkage rates may be required in studies of certain ethnic groups or certain causes of death among children and infants.
ICES was founded in 1992 to study the health care system and promote effective, efficient and equitable health care. Over 27 years later, the goal remains largely unchanged, though the institute has grown in size and impact. Created as an independent not-for-profit research institute and given what was, at the time, unprecedented access to administrative health data records for the population of Ontario, ICES' initial focus was to better understand the delivery of hospital services and translate its findings into better health care and policy. From modest beginnings with a handful of researchers located in a few hospital offices, ICES has grown to encompass a community of almost 500 scientists and staff across a network of seven physical sites in Ontario. The original focus on hospital-based services has expanded significantly and now includes research and analysis of community-based health services, health policy, Indigenous health, social determinants of health, and data science.
IntroductionA significant amount of valuable information in Electronic Health Records (EHR) such as laboratory test results or echocardiogram interpretations is embedded in lengthy free-text fields. Often patients' personal information is also included in these narratives. Privacy legislation in different jurisdictions requires de-identification of this information prior to making it available for research. This process can be challenging and time-consuming. In particular, rule-based algorithms may lead to over-masking of essential medical terms, conditions, or devices that are named after individuals.
Objectives and ApproachWe aimed to enhance ICES' existing rule-based application to make it contextually-driven by applying Artificial Intelligence (AI). The ICES team collaborated with computer scientists at the University of Manchester who had already published work in this area and Evenset, a Toronto-based software company. Based on the Manchester University de-identification framework for name entity recognition, three machine learning-based algorithms for name entity recognition were implemented: CRF, BiLSTM recurrent neural networks with GLoVe and ELMo word embeddings. The models were trained on three different types of ICES data: Laboratory results, Electronic Medical Record (EMR) and echocardiogram data. Evenset developed the user interface and the masking modules.
ResultsPreliminary tests have generated very promising results. To improve accuracy of the models, additional data annotation to expand the training datasets is currently being undertaken at ICES. The final framework will be available as an open-source tool for public.
Conclusion / ImplicationsA collaborative approach for solving complex problems like de-identification of text-based medical data is highly efficient, especially where there are unique sets of expertise, resources, data and clinical knowledge among stakeholders.
IntroductionImproving the care and management of patients with diabetes, particularly those with extreme blood glucose and/or cholesterol levels, has been identified as a key priority area for healthcare in Ontario. A multi-organizational collaboration produces audit-and-feedback reports distributed to consenting primary care physicians across the province for quality improvement purposes.
Objectives and ApproachWe examined the feasibility of linking the Ontario Laboratory Information System (OLIS), a large and nearly population-wide database of laboratory test results in Ontario, with the existing provincial audit-and-feedback reporting structure to integrate aggregated, physician-level measures of glycemic and cholesterol control among patients with diabetes.
All Ontario residents alive on March 31, 2014, attached to a primary care physician, and diagnosed with diabetes for at least two years were included. These patients were linked to OLIS to extract laboratory test orders and results for glycated hemoglobin (HbA1C) and low-density lipoproteins (LDL) between April 1, 2013 and March 31, 2014.
ResultsThere were 1,108,530 diabetes patients included who were assigned to 10,085 primary care physicians. During fiscal year (FY) 2013, 70%, 64%, and 59% of diabetes patients were tested for HbA1C, LDL, and both measures, respectively. Among the 648,238 diabetes patients with at least one of each test in FY2013, 13% had a HbA1C test exceeding a threshold of 9%, 4% had a LDL test exceeding a threshold of 4 mmol/L, and 0.8% exceeded both thresholds. At the physician-level, the median (Interquartile Range) proportions of diabetes patients exceeding the testing thresholds were 12% (9%-16%) for HbA1c and 4% (2%-6%) for LDL. In a multilevel logistic regression model, there was significant between-physician variability in the proportions of diabetes patients exceeding the HbA1C (p
Conclusion/ImplicationsWe developed a mechanism for integrating population-wide, clinical laboratory test results into physician audit-and-feedback reports to improve diabetes care in Ontario. Significant variation observed in the aggregated, physician-level proportions of diabetes patients testing above clinical thresholds for HbA1C and LDL highlights the importance of reporting such information to physicians.
IntroductionThe Ontario Brain Institute has developed Brain-CODE, an informatics platform, to support the acquisition, storage, management and analysis of multi-modal data. The standardized research data within Brain-CODE spans several brain disorders, allowing for integrative analyses, while also providing the opportunity to leverage existing clinical administrative data holdings through external linkages.
Objectives and ApproachWithin Ontario, the majority of individuals who access the healthcare system have a unique identifier, the Ontario Health Insurance Plan (OHIP) number. The OHIP number can facilitate linkages with administrative data holdings, such as those at the Institute for Clinical Evaluative Sciences (ICES). Given that OBI is not permitted under Ontario's privacy legislation to hold OHIP numbers, identifiers for consented participants are encrypted using a public key mechanism upon entry into Brain-CODE, where the private key is inaccessible. To facilitate linkages involving OHIP numbers between Brain-CODE and ICES, Brain-CODE Link software was co-developed by members of the Indoc Consortium.
ResultsBrain-CODE Link allows a deterministic linkage between encrypted identifiers (OHIP numbers), without revealing participant identity. The same homomorphic encryption algorithm applied to identifiers upon entry to Brain-CODE, is applied to relevant identifiers within ICES data holdings. Encrypted identifiers from Brain-CODE are securely transferred to ICES, where a comparison computation calculates differences between the encrypted sets. These differences are sent to a semi-trusted third party, who has no access to the original data, to decrypt the differences using the private key. A zero difference indicates a set of matching identifiers. One of the main challenges during testing and development of Brain-CODE Link was ensuring the software was capable of scaling to a population level, performing a large number of comparisons, in a computationally efficient manner.
Conclusion/ImplicationsOngoing pilot projects within the areas of epilepsy, neurodevelopment disorders, and neurodegeneration will be the first examples of linkages between Brain-CODE and ICES. Brain-CODE Link has successfully performed several billion test comparisons, indicating its suitability to function as a scalable privacy preserving record linkage to support comprehensive analyses.
IntroductionThe Ontario Brain Institute has developed Brain-CODE, an informatics platform designed to support the collection, storage, federation, sharing and analysis of different neuroscience research data types across several brain disorders. Linking such "deep" research data with "broad" health administrative data allows for improved characterization of disorders and supports the development of related health and social policies (Anderson et al., 2015). A privacy preserving record linkage protocol, developed through the Indoc Consortium, has been used to facilitate such linkages between Brain-CODE and administrative data holdings at the Institute for Clinical Evaluative Sciences (ICES; e.g., emergency department use, inpatient records, prescription drug utilization) (Gee et al., 2018).
Objectives and ApproachThree linkage pilots in the areas of neurodevelopmental disorders, epilepsy, and stroke research have been completed with >99% success match rates across all projects. However, each of these projects required a significant amount of human and computational resources to complete. With other similar data linkages being planned, it was determined that a more permanent solution was required rather than completing linkages on a project-by-project basis. The governance and technical elements to support the creation and maintenance of such a crosswalk between Brain-CODE and ICES were reviewed with an implementation plan subsequently developed.
Results:A methodology for creating a crosswalk between Brain-CODE and ICES has been established. The same privacy preserving record linkage protocol, as used in the previous linkage pilots, will support the creation of this crosswalk. A plan has been established to update this crosswalk annually to account for new study participants on Brain-CODE.
Conclusion / ImplicationsThe creation of this crosswalk will allow for a more streamlined approach of data linkage between Brain-CODE and ICES. Such an approach can significantly reduce overall resourcing requirements, enable more efficient data linkages, and contribute to the coupling of "broad" and "deep" data.
IntroductionResearch data combined with administrative data provides a robust resource capable of answering unique research questions. However, in cases where personal health data are encrypted, due to ethics requirements or institutional restrictions, traditional methods of deterministic and probabilistic record linkages are not feasible. Instead, privacy-preserving record linkages must be used to protect patients' personal data during data linkage. ObjectivesTo determine the feasibility and validity of a deterministic privacy preserving data linkage protocol using homomorphically encrypted data. MethodsFeasibility was measured by the number of records that successfully matched via direct identifiers. Validity was measured by the number of records that matched with multiple indirect identifiers. The threshold for feasibility and validity were both set at 95%. The datasets shared a single, direct identifier (health card number) and multiple indirect identifiers (sex and date of birth). Direct identifiers were encrypted in both datasets and then transferred to a third-party server capable of linking the encrypted identifiers without decrypting individual records. Once linked, the study team used indirect identifiers to verify the accuracy of the linkage in the final dataset. ResultsWith a combination of manual and automated data transfer in a sample of 8,128 individuals, the privacy-preserving data linkage took 36 days to match to a population sample of over 3.2 million records. 99.9% of the records were successfully matched with direct identifiers, and 99.8% successfully matched with multiple indirect identifiers. We deemed the linkage both feasible and valid. ConclusionsAs combining administrative and research data becomes increasingly common, it is imperative to understand options for linking data when direct linkage is not feasible. The current linkage process ensured the privacy and security of patient data and improved data quality. While the initial implementations required significant computational and human resources, increased automation keeps the requirements within feasible bounds.
IntroductionSupporting standardized approaches to common tasks is an important component of quality research using linked administrative data. Standard concept definitions and classifications are vital for ensuring accuracy and consistency in definitions between projects, and improving efficiency and quality. Other leading organizations have published online standard definitions of concepts and classifications.
Objectives and ApproachWe developed a comprehensive concept dictionary using a standardized definition template of key components including data sources, codes, scale or range of values, validation details, limitations, SAS code and formats, related concepts, and MeSH terms. A web-based application (built on the Microsoft SharePoint platform) was developed to offer the latest web content authoring capabilities, and advanced search mechanisms enabling the user to search concepts by MeSH terms and key words. It also allowed for navigating concepts through category navigation including clickable categories and sub-categories. Entries will be reviewed annually to ensure the content remains up-to-date.
ResultsTo date, ten concepts, with accompanying codes, have been published on the concept dictionary with another ten currently undergoing editorial review. These concepts span a variety of topics such as injuries, mental health and addictions-related outpatient services, and annual physical exams. New concepts written by content experts and reviewed by an editorial committee will be added on an on-going basis; thirty concepts are currently under development.
Conclusion/ImplicationsDevelopment of a concept dictionary provides standardized definitions, algorithms and codes to ensure consistency and quality of research and analysis across multiple projects. Future aims include expansion of the internal organizational site to an external site through collaboration with key stakeholders.
IntroductionHealth care systems have faced unprecedented challenges due to the COVID-19 pandemic. Access to timely population-based data has been vital to informing public health policy and practice. MethodsWe describe how ICES, an independent not-for-profit research and analytic institute in Ontario, Canada, pivoted existing research infrastructure and engaged health system stakeholders to provide near real-time population-based data and analytics to support Ontario's COVID-19 pandemic response. ResultsSince April 2020, ICES provided the Ontario COVID-19 Provincial Command Table and public health partners with regular and ad hoc reports on SARS-CoV-2 testing and COVID-19 vaccine coverage. These reports: 1) helped identify congregate care/shared living settings that needed testing and prevention efforts early in the pandemic; 2) provided early indications of inequities in testing and infection in marginalized neighbourhoods, including areas with higher proportions of immigrants and visible minorities; 3) identified areas with high test positivity, which helped Public Health Units target and evaluate prevention efforts; and 4) contributed to altering the province's COVID-19 vaccine roll-out strategy to target high-risk neighbourhoods and helping Public Health Units and community organizations plan local vaccination programs. In addition, ICES is a key component of the Ontario Health Data Platform, which provides scientists with data access to conduct COVID-19 research and analyses. Discussion and ConclusionICES was well-positioned to provide rapid analyses for decision-makers to respond to the evolving public health emergency, and continues to contribute to Ontario's pandemic response by providing timely, relevant reports to health system stakeholders and facilitating data access for externally-funded COVID-19 research.
BackgroundThe linkage of records across administrative databases has become a powerful tool to increase information available to undertake research and analytics in a privacy protective manner. ObjectiveThe objective of this paper was to describe the data integration strategy used to link the Ontario Ministry of Children, Community and Social Services (MCCSS)-Social Assistance (SA) database with administrative health care data. MethodsDeterministic and probabilistic linkage methods were used to link the MCCSS-SA database (2003-2016) to the Registered Persons Database, a population registry containing data on all individuals issued a health card number in Ontario, Canada. Linkage rates were estimated, and the degree of record linkage and representativeness of the dataset were evaluated by comparing socio-demographic characteristics of linked and unlinked records. ResultsThere were a total of 2,736,353 unique member IDs in the MCCSS-SA database from the 1st January 2003 to 31st December 2016; 331,238 (12.1%) were unlinked (linkage rate = 87.9%). Despite 16 passes, most record linkages were obtained after 2 deterministic (76.2%) and 14 probabilistic passes (11.7%). Linked and unlinked samples were similar for most socio-demographic characteristics (i.e., sex, age, rural dwelling), except migrant status (non-migrant versus migrant) (standardized difference of 0.52). Linked and unlinked records were also different for SA program-specific characteristics, such as social assistance program, Ontario Works and Ontario Disability Support Program (standardized difference of 0.20 for each), data entry system, Service Delivery Model Technology only and both Service Delivery Model Technology and Social Assistance Management System (standardized difference of 0.53 and 0.52, respectively), and months on social assistance (standardized difference of 0.43). ConclusionsAdditional techniques to account for sub-optimal linkage rates may be required to address potential biases resulting from this data linkage. Nonetheless, the linkage between administrative social assistance and health care data will provide important findings on the social determinants of health.