You are here

Stepwise Variable Selection for Loglinear Mixture in Record Linkage

Journal Name:

Publication Year:

AMS Codes:

Abstract (2. Language): 
A model building strategy is proposed to improve the probabilistic match in record linkage with focus on the loglinear mixture model of two components, each for the matched and unmatched pairs respectively. In reality, comparison attributes (i.e., covariates) often interact with each other, leading to more or less interactions in the loglinear models for both the matched and unmatched pairs. However, the interactions patterns are often not the same for both components. Particularly, because the number of matched pairs is usually very small compared with that of unmatched pairs in practice, the model for matched pairs can not be fitted with the same higher order interactions as that for the unmatched pairs. The proposed strategy is data-driven, and attempts to avoid both underfitting and overfitting due to subjective model specification for the data. Starting from the situation of no interaction, we add interactions sequentially in two loglinear components using the forward selection approach. Specifically, we define the alternatively climbing pathways through mixture families of two components with higher order interactions. The mixture models expanded along a pathway are nested successively. Thus, conventional tests used for comparison of nested models can be applied. Regarding parameter estimation for the mixture, a simplified method (including the choice of initial values of parameters) for the EM algorithm is developed, which facilitates the mixture model fitting using existing packages and functions in sophisticated statistical software like R. Simulation studies have then been conducted for various situations to assess the model selection approach, and comparisons of the selected models with the naive model assuming field independence have been made. We have applied this strategy to the record linkage case study in 2006 Annual Meeting of Statistical Society of Canada (SSC) and identified interactions among certain comparison attributes for both matched and unmatched pairs; these interactions are not always the same for both mixture components.
141-162

REFERENCES

References: 

[1] E. D. Acheson. Record Linkage in Medicine. E. & S. Livingstone Ltd., Edinburgh and
London, 1968.
[2] A. Agresti. Categorical Data Analysis, Second Edition. John Wiley & Sons, Inc., Hoboken,
New Jersey, 2002.
[3] J. B. Copas and F. J. Hilton. Record linkage: Statistical models for matching computer
records. Journal of the Royal Statistical Society, A, 153:287–320, 1990.
[4] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data
via the em algorithm. Journal of the Royal Statistical Society, B, 39:1–38, 1977.
[5] M. E. Fair and P. Whitridge. Tutorial on record linkage. In W. Alvey and B. Jamerson,
editors, Record Linkage Techniques - 1997: Proceedings of an International Workshop
and Exposition., pages 457–482, Arlington, VA, 1997. Federal Committee on Statistical
Methodology, and Office of Management and Budget.
[6] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American
Statistical Association, 64:1183–1210, 1969.
[7] M. A. Jaro. Advances in record-linkage methodology as applied to matching the 1985
census of tampa florida. Journal of the American Statistical Association, 84:414–420,
1989.
[8] M. A. Jaro. Probabilistic linkage of large public health datafiles. Statistics in Medicine,
14:491–498, 1995.
[9] M. D. Larsen. Modeling issues and the use of experience in record linkage. In W. Alvey
and B. Jamerson, editors, Record Linkage Techniques - 1997: Proceedings of an Interna-
tional Workshop and Exposition., pages 95–105, Arlington, VA, 1997. Federal Committee
on Statistical Methodology, and Office of Management and Budget.
[10] M. D. Larsen and D. B. Rubin. Iterative automated record linkage using mixture models.
Journal of the American Statistical Association, 96:32–41, 2001.
[11] G. J. McLachlan and T. Krishnan. The EM algorithm and extensions. Wiley, New York,
1997.
[12] H. B. Newcombe. Handbook of Record Linkage: Methods for Health and Statistical Studies,
Administration, and Business. Oxford University Press, Inc., New York, 1988.
[13] H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital
records. Science, 130:954–959, 1959.
[14] Y. Thibaudeau. The discrimination power of dependency structures in record linkage.
Survey Methodology, 19:31–38, 1993.
[15] W. E. Winkler. Using the em algorithm for weight computation in the fellegi-sunter
model of record linkage. In Proceedings of the Section on Survey Research Methods.,
pages 667–671. American Statistical Association, 1988.
[16] W. E. Winkler. Method for adjusting for lack of independence in an application of the
fellegi-sunter model of record linkage. Survey Methodology, 15:101–107, 1989.
[17] W. E. Winkler. Overview of record linkage and current research directions. Research
Report Series Statistics #2006-2, U.S. Bureau of the Census, 2005.

Thank you for copying data from http://www.arastirmax.com