Configuring key generation
-
Double-click tGenKey to display the Basic settings view and define the component properties.
You can click and import blocking keys from the match rules created with the VSR algorithm and tested in the Profiling perspective of Talend Studio and use them in your Job. Otherwise, define the blocking key parameters as described in the below steps.
-
Under the Algorithm table, click the plus button to add a row in this table.
-
In the column column, click the newly added row and select from the list the column you want to process using an algorithm. In this example, select PostalCode.
Information noteNote: When you select a date column on which to apply an algorithm or a matching algorithm, you can decide what to compare in the date format.For example, if you want to only compare the year in the date, in the component schema set the type of the date column to Date and then enter "yyyy" in the Date Pattern field. The component then converts the date format to a string according to the pattern defined in the schema before starting a string comparison.
-
Select the Show help check box to display instructions on how to set algorithms/options parameters.
-
In the algorithm column, click the newly added row and select from the list the algorithm you want to apply to the corresponding column. In this example, select first N characters of the string.
-
Click in the Value column and enter the value for the selected algorithm, when needed. In this scenario, type in 2.
In this example, we want to generate a functional key that holds the first two characters of the postal code for each of the data rows and we do not want to define any extra options on these columns.
Make sure to set a value for the algorithm which need one, otherwise you may have a compilation error when you run the Job.
Once you have defined the tGenKey properties, you can display a statistical view of these parameters. To do so:
-
Right-click the tGenKey component and select View Key Profile in the contextual menu.
The View Key Profile editor displays, allowing you to visualize the statistics regarding the number of rows per block and to adapt them according to the results you want to get.
When you are processing a large amount of data and when this component is used to partition data in order to use them in a matching component (such as tRecordMatching or tMatchGroup), it is preferable to have a limited number of rows in one block. An amount of about 50 rows per block is considered optimal, but it depends on the number of fields to compare, the total number of rows and the time considered acceptable for data processing.
For a use example of the View Key Profile option, see Comparing columns and grouping in the output flow duplicate records that have the same functional key.