Setting up the Job
Procedure
- Drop the following components from the Palette onto the design workspace: tFileInputDelimited, tMatchPredict and tFileOutputDelimited.
- Connect tFileInputDelimited to tMatchPredict using the Main link.
- Connect tMatchPredict to tFileOutputDelimited using the Suspect duplicates link.
- Check that you have defined the connection to the Spark cluster and activated checkpointing in the Run > Spark Configuration view as described in Computing suspect pairs and suspect sample from source data.
Results
