Record Linkage and Knowledge Discovery


In-depth Articles

The Knowledge Discovery in Database process is detailed in the following phases:

  1. Data cleaning (removal of formal and logical errors and outliers)
  2. Data integration (combination of different sources)
  3. Data selection (extraction of data for analysis from the database)
  4. Data transformation (manipulation of data into forms suitable for data mining)
  5. Data mining (application of analytical methods to synthesize significant relationships)
  6. Relationship evaluation (classification of relationships in terms of utility)
  7. Knowledge presentation (visualization and synthesis of useful relationships)

In this sense, it may be useful to define a key operation that is antecedent to Knowledge Discovery that statisticians often find themselves having to face: Record Linkage.
Record Linkage is an operation that allows joining multiple datasets in order to have more information. The objective is to identify records referring to the same individual, but located in different files, through common keys that do not perfectly correspond.

Input: two data sets that observe overlapping groups of units.
Problem: lack of a unique and error-free identification code
Solution: use of a set of variables capable (jointly) of identifying records
Attention: variables can have "problems" (there is no unique one)
Objective: greater number of correct matches, fewer number of incorrect matches


There are mainly 3 types of Record Linkage: