Skip to main content Skip to complementary content

Creating a custom matching algorithm

The tRecordMatching component enables you to use a user-defined matching algorithm for obtaining the results you need.

A custom matching algorithm is written manually and stored in a .jar file (Java archive). Talend provides an example .jar file on the basis of which you are supposed to develop your own file easily.

Procedure

  1. In Eclipse, check out the test.mydistance project from svn at:
  2. In this project, navigate to the Java class named MyDistance.Java: https://github.com/Talend/tdq-studio-se/tree/master/sample/test.mydistance/src/main/java/org/talend/mydistance.
  3. Open this file that has the below code:
    package org.talend.mydistance;
    
    import org.talend.dataquality.record.linkage.attribute.AbstractAttributeMatcher;
    import org.talend.dataquality.record.linkage.constant.AttributeMatcherType;
    
    /**
     * @author scorreia
     * 
     * Example of Matching distance.
     */
    public class MyDistance extends AbstractAttributeMatcher {
    
        /*
         * (non-Javadoc)
         * 
         * @see org.talend.dataquality.record.linkage.attribute.IAttributeMatcher#getMatchType()
         */
        @Override
        public AttributeMatcherType getMatchType() {
            // a custom implementation should return this type AttributeMatcherType.custom
            return AttributeMatcherType.CUSTOM;
        }
    
        /*
         * (non-Javadoc)
         * 
         * @see org.talend.dataquality.record.linkage.attribute.IAttributeMatcher#getMatchingWeight(java.lang.String,
         * java.lang.String)
         */
        @Override
        public double getWeight(String arg0, String arg1) {
            // Here goes the custom implementation of the matching distance between the two given strings.
            // the algorithm should return a value between 0 and 1.
    
            // in this example, we consider that 2 strings match if their first 4 characters are identical
            // the arguments are not null (the check for nullity is done by the caller)
            int MAX_CHAR = 4;
            final int max = Math.min(MAX_CHAR, Math.min(arg0.length(), arg1.length()));
            int nbIdenticalChar = 0;
            for (; nbIdenticalChar < max; nbIdenticalChar++) {
                if (arg0.charAt(nbIdenticalChar) != arg1.charAt(nbIdenticalChar)) {
                    break;
                }
            }
            if (arg0.length() < MAX_CHAR && arg1.length() < MAX_CHAR) {
                MAX_CHAR = Math.max(arg0.length(), arg1.length());
            }
            return (nbIdenticalChar) / ((double) MAX_CHAR);
    }
  4. In this file, type in the class name for the custom algorithm you are creating in order to replace the default name. The default name is MyDistance and you can find it in the line: public class MyDistance implements IAttributeMatcher.
  5. In the place where the default algorithm is in the file, type in the algorithm you need to create to replace the default one. The default algorithm reads as follows:
            int MAX_CHAR = 4;
            final int max = Math.min(MAX_CHAR, Math.min(arg0.length(), arg1.length()));
            int nbIdenticalChar = 0;
            for (; nbIdenticalChar < max; nbIdenticalChar++) {
                if (arg0.charAt(nbIdenticalChar) != arg1.charAt(nbIdenticalChar)) {
                    break;
                }
            }
            if (arg0.length() < MAX_CHAR && arg1.length() < MAX_CHAR) {
                MAX_CHAR = Math.max(arg0.length(), arg1.length());
            }
            return (nbIdenticalChar) / ((double) MAX_CHAR);
    
  6. Save your modifications.
  7. Using Eclipse, export this new .jar file.

Results

Then this user-defined algorithm is ready to be used by the tRecordMatching component.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!