Generating duplicate data from an input flow
This scenario describes a basic Job that generates a sample of duplicate data from an input flow by using probability theories and specific criteria on three columns: Name, City and DOB (date of birth).
This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real-Time Big Data Platform, Talend MDM Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.
This scenario uses:
- tFileInputDelimited as the input component.
- tDuplicateRow to generate duplicate data from the input flow.
- tFileOutputDelimited to output the data in a delimited file.
Below is a capture of a sample data of the input flow:
Setting up the Job
Procedure
- Drop the following components from the Palette onto the design workspace: tFileInputDelimited, tDuplicateRow and tFileOutputDelimited.
- Connect all the components together using the Row > Main link.
Configuring the input data
Procedure
Configuring the duplicate data
Procedure
Configuring the output component
Procedure
Executing the Job
Procedure
Showing chart results of each of the probability distributions
The best way to see how duplicates are generated according to each of the
three probability distributions is to create a match analysis on each of the results and
compare the charts.
Procedure
Results
Bernoulli distribution: The curve is
symmetrical. The groups of duplicates are distributed evenly on each side of an
average value, 4 in this example. This average value is the average number of
duplicates in a group of duplicates and this value is the number you set in the
Average group size field in the basic
settings of the tDuplicateRow component.
Poisson distribution: The curve is not
symmetrical. The groups of duplicates are distributed unevenly.
Geometric distribution: The form of the curve is
decided by the percentage you set for the duplicated records in the
tDuplicateRow basic settings. The higher the percentage is, the
fewer groups with many records you will have. In this example the
percentage for the duplicate records is set to 80%. This is
why many groups with two-record duplicates are generated (148 groups), while there is only one group that has 14, 15 and 16 duplicates.