Data Linkage

Survey data is invaluable for understanding the incidence of social and economic factors and understanding the relationships among them. Nevertheless, there can be significant limitation in the ability to use these data to analyze policy-related questions. Among these are:

  • The cost of fielding a survey
  • Nonresponse rates (which have increased substantially over the last several decades)
  • Measurement error – respondents are not always able to accurately answer the questions posed to them
  • Estimability issues associated with a fixed sample size – it may not be possible to make valid statistical estimates for small areas
  • Lack of necessary detail - It is difficult for surveys to collect detailed data such as claims histories

To a large degree, these issues can be addressed by the introduction of administrative record data into the analysis set—but here, the ability to link data from multiple sources requires specialized expertise as we have at NORC

NORC has the experience to locate and access data from multiple sources in order to support robust statistical analyses:

  • In-house surveys
  • Public use survey files
    • NORC has the knowledge to identify appropriate public-use data files and understand the specific strengths and shortcoming of each including specification of sampling and measurement error.
  • Micro-data records
    • Administrative – Federal and state government
  • Proprietary data
    • Credit bureau
    • Data broker
  • Health Records
    • Enrollment records
    • Claims records
  • Registers of business establishment, providers, organizations
    • Coding lists – as ZIP Code to county or NAICS to SIC   

    NORC has expertise and advanced tools to develop functional analysis files that integrate the data from these multiple sources

    • Database Design – effectively structuring data coming from multiple sources so that it can be readily used to flexibly answer multiple research questions
    • Record Linkage is a technique that allows for matching in the case of incomplete, erroneous, or varying identification fields and is therefore sometime called fuzzy matching.
      • Editing of data element used for matching as with address standardization, nickname rendering, corrections of misspellings, etc.
      • Access to record linkage software that is used by the Census Bureau to match large survey and administrative record files. This software is available for customization (including state-of-the-art parameter tuning) to user-specified files. Weight tuning is based on unsupervised learning according to the Fellegi-Sunter record linkage paradigm.
      • The ability to evaluate the quality of matches: both false positives and false negatives and use this to further tune matching so that it results in the best set for analysis. This is to say, for certain analyses it is absolutely critical that the number of false matches is minimized, but for others, it is more import to make sure that the set of matches is as complete as possible, even if this means that some false matches are accepted.
    • Statistical Matching: It is often the case that direct record linkage is not feasible either because of the lack of identifying elements (as, names, addresses, etc.) or the small number of individuals (or other entities) who overlap from the multiple files. In these cases, it is often useful to perform statistical linkages such that the records brought together, while not the same person (or entity), are similar enough to use as an analysis set.