Matching two records
Creating a master record is an iterative process: each new master record can be used to find new duplicates.
You can choose between two different algorithms to create master records:
- Simple VSR Matcher
- T-Swoosh. This algorithm is only available in the Standard component.
The main difference between the two algorithms is that T-Swoosh creates, for each master record, a new record that does not exist in the list of input records.
Matching measures
You can also compare two records on many attributes. For two records to match, the following two conditions must hold:
- When using the T-Swoosh algorithm, the score for each matching function in the match rule must exceed the threshold, if any specified. By default, the threshold is set to 1. This means exact match for most matching functions, excepted for Exact - ignore case and potentially any custom matching function.
- The global score, computed as a weighted score of the different matching functions, must exceed the match threshold. The score is equal to Σ(wi × si(r1,r2)) / Σwi where wi is the confidence weight of the matching function i and si(r1,r2) is the score of the matching function i over records r1 and r2 .
data:image/s3,"s3://crabby-images/7f8f1/7f8f1473bfe2197672b3dabb56679edcb4766f7f" alt="Configuration of the tMatchGroup component."
In this example, the score for the Jaro-Winkler metric on the fname attribute must exceed 0.7 and the global score, with a confidence weight of 1 on each of the two attributes, must exceed 0.85.
data:image/s3,"s3://crabby-images/55eb4/55eb4fe750881e872fe7400d311e5608d2ee5f78" alt="Example of the weighted average computation."
- As the Confidence Weight of both attributes is set to 1, the normalized weight of each attribute is 0.5.
- The attribute matching distance is 1 for the lname attribute and 0.722... for the fname attribute.
- The score is calculated as follows: 0.5 x 1 + 0.5 x 0.722... = 0.8611...
Match rules
Two records match if at least one of the match rules is satisfied. As soon as two records match according to a given rule, the other rules are not checked.