Skip to main content Skip to complementary content

tGenKey properties for Apache Spark Batch

These properties are used to configure tGenKey running in the Spark Batch Job framework.

The Spark Batch tGenKey component belongs to the Data Quality family.

The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.

Basic settings

Schema and Edit schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields.

Click Sync columns to retrieve the schema from the previous component connected in the Job.

 

Built-In: You create and store the schema locally for this component only.

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.

Click the import icon to import blocking keys from the match rules that are defined and saved in the Studio repository.

When you click the import icon, a Match Rule Selector wizard is opened to help you import blocking keys from the match rules listed in the Studio repository and use them in your Job.

You can import blocking keys only from match rules that are defined with the VSR algorithm and saved in the Studio repository. For further information, see Importing match rules from the Studio repository

Column

Select the column(s) from the main flow on which you want to define certain algorithms to set the functional key.

Information noteNote: When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format.

For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter "yyyy" in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.

Pre-Algorithm

If required, select the relevant matching algorithm from the list:

Remove diacritical marks: removes any diacritical mark.

Remove diacritical marks and lower case: removes any diacritical mark and converts to lower case before generating the code of the column.

Remove diacritical marks and upper case: removes any diacritical mark and converts to upper case before generating the code of the column.

Lower case: converts the field to lower case before applying the key algorithm.

Upper case: converts the field to upper case before applying the key algorithm.

Add left position character: enables you to add a character to the left of the column.

Add right position character: enables you to add a character to the right of the column.

Value

Set the algorithm value, where applicable.

Algorithm

Select the relevant algorithm from the list:

First character of each word: includes in the functional key the first character of each word in the column. When using Japanese characters in the input data, the input text should be tokenized. For more information, see tJapaneseTokenize.

N first characters of each word: includes in the functional key N first characters of each word in the column. When using Japanese characters in the input data, the input text should be tokenized. For more information, see tJapaneseTokenize.

First N characters of the string: includes in the functional key N first characters of the string.

Last N characters of the string: includes in the functional key N last characters of the string.

First N consonants of the string: includes in the functional key N first consonants of the string. Japanese and Chinese characters are not supported.

First N vowels of the string: includes in the functional key N first vowels of the string. Japanese and Chinese characters are not supported.

Pick characters: includes in the functional key the characters located at a fixed position (corresponding to the set digital/range).

Exact: includes in the functional key the full string.

Substring(a,b): includes in the functional key character according to the set index.

Soundex code: generates a code according to a standard English phonetic algorithm. This code represents the character string that will be included in the functional key. Japanese and Chinese characters are not supported.

Metaphone code: generates a code according to the character pronunciation. This code represents the character string that will be included in the functional key. Japanese and Chinese characters are not supported.

Double-metaphone code: generates a code according to the character pronunciation using a new version of the Metaphone phonetic algorithm, that produces more accurate results than the original algorithm. This code represents the character string that will be included in the functional key. Japanese and Chinese characters are not supported.

Fingerprint key: generates the functional key from a string value through the following sequential process:
  1. remove leading and trailing whitespace,

  2. change all characters to their lowercase representation,

  3. remove all punctuation and control characters,

  4. split the string into whitespace-separated tokens,

  5. sort the tokens and remove duplicates,

  6. join the tokens back together,

    Because the string parts are sorted, the given order of tokens does not matter. So, Cruise, Tom and Tom Cruise both end up with a fingerprint cruise tom and therefore end up in the same cluster.

  7. normalize extended western characters to their ASCII representation, for example gödel to godel.

    This reproduce data entry mistakes performed when entering extended characters with an ASCII-only keyboard. However, this procedure can also lead to false positives, for example gödel and godél would both end up with godel as their fingerprint but they are likely to be different names. So this might work less effectively for datasets where extended characters play substantial differentiation role.

nGramkey: this algorithm is similar to the fingerPrintkey method described above. But instead of using whitespace separated tokens, it uses n-grams, where the n can be specified by the user. This method generates the functional key from a string value through the following sequential process:
  1. change all characters to their lowercase representation,

  2. remove all punctuation and control characters,

  3. obtain all the string n-grams,

  4. sort the n-grams and remove duplicates,

  5. join the sorted n-grams back together,

  6. normalize extended western characters to their ASCII representation, for example gödel to godel.

    For example, the 2-gram fingerprint of Paris is arispari and the 1-gram fingerprint is aiprs.

    The delivered implementation of this algorithm is 2-grams.

Information noteNote:

If the column on which you want to use the nGramkey algorithm can have data, with only 0 or 1 characters, you must filter this data before generating the functional key. This way you avoid potentially comparing records to each other that should not be match candidates.

Cologne phonetics: a Soundex phonetic algorithm optimized for the German language. It encodes a string into a Cologne phonetic value. This code represents the character string that will be included in the functional key. Japanese and Chinese characters are not supported.

Value

Set the algorithm value, where applicable.

If you do not set a value for the algorithms which need one, the Job runs with a compilation error.

Post-Algorithm

If required, select the relevant matching algorithm from the list:

Use default value (string): enables you to choose a string to replace null or empty data.

Add left position character: enables you to add a character to the left of the column.

Add right position character: enables you to add a character to the right of the column.

Value

Set the option value, where applicable.

Show help

Select this check box to display instructions on how to set algorithms/options parameters.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio User Guide.

Usage

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.

Spark Connection

In the Spark Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
  • Yarn mode (Yarn client or Yarn cluster):
    • When using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab.

    • When using HDInsight, specify the blob to be used for Job deployment in the Windows Azure Storage configuration area in the Spark configuration tab.

    • When using Altus, specify the S3 bucket or the Azure Data Lake Storage for Job deployment in the Spark configuration tab.
    • When using Qubole, add a tS3Configuration to your Job to write your actual business data in the S3 system with Qubole. Without tS3Configuration, this business data is written in the Qubole HDFS system and destroyed once you shut down your cluster.
    • When using on-premises distributions, use the configuration component corresponding to the file system your cluster is using. Typically, this system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the configuration component corresponding to the file system your cluster is using, such as tHDFSConfiguration Apache Spark Batch or tS3Configuration Apache Spark Batch.

    If you are using Databricks without any configuration component present in your Job, your business data is written directly in DBFS (Databricks Filesystem).

This connection is effective on a per-Job basis.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!