This work was conducted and funded by the Bureau of Labor Statistics. The purpose of the linkage was to enable data available from OSHA’s Injury Tracking Application (ITA) to be used to complete items on the Survey of Occupational Injuries and Illnesses (SOII). This would result in a reduction of respondent burden to SOII participants and possibly improve the accuracy of the derived data elements.
Preparation for the linkage involved evaluating and standardizing data elements shared across the files. In particular, this required geocoding address data that were key to enabling the linkage. Because this geocoding was done on BLS servers, we used a self-developed geocoding routine to perform it based on comparisons to the U.S. Census Bureau’s TIGER database.
Another critical element of this analysis is the comparison of organization names. As there can be several ways that the same name can be transcribed on a database, producing an effective comparison required the standardization of the name elements: Use of similar capitalization and punctuation, standardization of abbreviations, removal of common tokens (such as prepositions and articles—‘a’, ‘an’, ‘the’). Additionally, because certain tokens in organizations occur fairly commonly, comparisons are improved by weighting each common token according to (the inverse of) its rarity. For this project, NORC developed the methodology and coding to make effective comparisons of organization names.
Integrating the results of geocoding and organization name comparisons, the actual record linkage process was built on the Fellegi-Sunter paradigm. M and U probabilities (field agreement proportion for matched and unmatched pairs, respectively) were estimated using a custom designed machine learning routine that minimized the distance between the expected and actual frequencies by agreement pattern (i.e., among comparison variables). The result of this linkage process was that each pair was assigned an estimated match probability. This analysis was conducted over several blocking passes and the results for each pair were summarized. Those pairs with a sufficiently high estimated match probability were retained for the delivered database.
For this project we developed an approach to measure the similarity or organization names.
We used a self-develop method and coding to evaluate links based on agreement patterns so to get the best fit to the frequency of comparison agreement values and assign a probability of match validity. This model allowed comparison variable interactions to be taken into account in the probability model.
We used a modified version of SAS's PROC GEOCODE to standardize addresses for use in linkage.
The results of the linkage analysis included a data file that showed likely or potential matched pair from ITA (i.e., the record most likely the true match) for each SOII record. With each matched pair on the file, we provided an estimated probability of true match status (i.e., that the records represent the same organization). The expectation is that BLS will use this file by accepting all links above a probability threshold they choose to as pairs highly likely to be matches.
Documentation of this data file and a report summarizing the process of developing it were delivered to the client: BLS.