The Knowledge Discovery in Database process is detailed in the following phases:
- Data cleaning (removal of formal and logical errors and outliers)
- Data integration (combination of different sources)
- Data selection (extraction of data for analysis from the database)
- Data transformation (manipulation of data into forms suitable for data mining)
- Data mining (application of analytical methods to synthesize significant relationships)
- Relationship evaluation (classification of relationships in terms of utility)
- Knowledge presentation (visualization and synthesis of useful relationships)
In this sense, it may be useful to define a key operation that is antecedent to Knowledge Discovery that statisticians often find themselves having to face: Record Linkage.
Record Linkage is an operation that allows joining multiple datasets in order to have more information. The objective is to identify records referring to the same individual, but located in different files, through common keys that do not perfectly correspond.
Input: two data sets that observe overlapping groups of units.
Problem: lack of a unique and error-free identification code
Solution: use of a set of variables capable (jointly) of identifying records
Attention: variables can have "problems" (there is no unique one)
Objective: greater number of correct matches, fewer number of incorrect matches
There are mainly 3 types of Record Linkage:
- Merge by matching
- Based on sorting files to be matched according to a common identification key
- It is very efficient
- It is sensitive to errors on the identification key (if it is not unique, better not to use it)
- It is advisable when the files to be matched belong to the same information system
- Deterministic matching
- Based on the concordance of a sufficient number of common variables: Can take into account missing values and errors in matching variables
- Allows to graduate the informative power of variables through scores:
- Same name = 2 points
- Same surname = 7 points
- Same year of birth = 3 points
- Scores can be established through statistical analysis on external data
- All choices on the comparison criterion are external to the treated data
- Probabilistic matching
- As in deterministic matching:
- Work on comparing all possible pairs;
- Use scores based on flexible criteria to establish matches
- But:
- The scores and thresholds used to choose matches depend on the problem under examination
- Disagreement levels in the data are also taken into account
The Phases of Record Linkage
Harmonization
Input file preparation (pre-processing):
- Harmonization of unit definitions;
- Harmonization of reference periods;
- Population completion;
- Harmonization of variable definitions;
- Harmonization of classifications;
- Adjustment of measurement errors (probabilistic only);
- Adjustment for non-responses (probabilistic only);
- Construction of derived variables.
Selection of Matching Variables
Selection of common identifying attributes (blocking and matching variables):
Desirable characteristics:
- 1. Universal
- 2. Permanent
- 3. Accurate
- 4. Non-sensitive
Choice of Comparison Function(s)
- Editing and parsing: Methods to handle typos or foreign name pronunciations.
- Sorting and blocking: Operations to:
- Help computers recognize records faster.
- Allow statistical operations on data.
Choice of Decision Model
Deterministic: Fixed rules; manual check (clerical review) for errors.
Probabilistic: Statistical model with "optimal" decision rules and error probabilities.
Decisions for Uncertain Matches
When data cannot discriminate a match, use:
- Alternative linkage techniques.
- Manual analysis (clerical review).
Data Quality and Analysis
Evaluate the accuracy of the result and determine if standard estimators can be applied directly.
| MANUAL REVIEW RESULTS |
Matched |
Not Matched |
| Linkage Matched |
TP |
FP |
| Linkage Not Matched |
TN |
FN |
FMR = FP / (TP + FP)
FNMR = FP / (TN + FN)
Sensitivity = TP / (TP + TN)
Specificity = FN / (FP + FN)