Shuffling data values to restrict the use of actual sensitive data

With the tDataShuffling component, you can shuffle sensitive information to replace it with other values for the same column from a different row, allowing production data to be safely used for purposes such as testing and training.

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real-Time Big Data Platform, Talend MDM Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

This scenario describes a Job which uses:

The tFixedFlowInput component to generate personal data including credit card numbers.
The tDataShuffling component to shuffle original data and replace values with other values for the same column from a different row.
The tFileOutputExcel component to output the shuffled dataset.

A Job using the tFixedFlowInput, tDataShuffling, and tFileOutputExcel components.

Prerequisites: Further restricting the use of sensitive data

When shuffling data, it is still advised to mask sensitive data. Remember also to consider relationships between the columns when shuffling data and make sure the original dataset cannot be reconstructed.

In this scenario, last names and first names are grouped together but the email adresses are not in the same group. Consequently, the email column does not relate to the lname and fname columns. Since the email column usually contains information about first names and last names, it may help attackers to reconstruct the original data.

Additionally, the address1, city and email columns are not in any group, so they were not shuffled. This means it is possible to infer, for example, that Robert Damstra lives at 1619 Stillman Court, Lynnwood.

Using this scenario, you can restrict the use of actual sensitive data even more:

To avoid the use of real credit card numbers, you can mask credit card numbers using the tDataMasking component.
To avoid the identification of customers with their email addresses, you can mask email addresses using the tDataMasking component.
To make it more difficult to read real addresses, you can add the address1 and city columns in other groups.

Tip: As tDataShuffling is supported on the Spark framework, you can convert this standard Job to a Spark Batch Job by editing the Job properties. This way you do not need to redefine the settings of the components in the Job.

Setting up the Job

Procedure

Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tDataShuffling and tFileOutputExcel.
Connect the three components together using the Main links.

Configuring the input component

Procedure

Double-click tFixedFlowInput to open its Basic settings view in the Component tab.
Create the schema through the Edit Schema button.

In the open dialog box, click the [+] button and add the columns that will hold the initial input data: customer id, credit_card, lname, fname, mi, address1, city, state_province, postal_code, country, phone and email.
Click OK.
In the Number of rows field, enter 1.
In the Mode area, select the Use Inline Content option.

In the Content table, enter the customer data you want to shuffle, for example:

0|4244487462024688|Nowmer|Sheri|A.|2433 Bailey Road|Tlaxiaco|Oaxaca|15057|Mexico|271-555-9715|SheriNowmer@@Tlaxiaco.org
1|3458687462024688|Nowmer|Alan|A.|2433 Bailey Road|Tlaxiaco|Oaxaca|15057|Mexico|271-555-9715|AlanNowmer@Tlaxiaco.org.org
2|4639587470586299|Whelply|Derrick|I.|2219 Dewing Avenue|Sooke|BC|17172|Canada|211-555-7669|DerrickWhelply@Sooke.org
3|2541387475757600|Derry|Jeanne||7640 First Ave.|Issaquah|WA|73980|USA|656-555-2272|JeanneDerry@Issaquah.org
4|7845987500482201|Spence|Michael|J.|337 Tosca Way|Burnaby|BC|74674|Canada|929-555-7279|MichaelSpence@Burnaby.org
5|1547887514054179|Gutierrez|Maya||8668 Via Neruda|Novato|CA|57355|$$#|387-555-7172|MayaGutierrez@Novato.org
6|5469887517782449|Damstra|Robert|F.|1619 Stillman Court|Lynnwood|WA|90792|$$#|922-555-5465|RobertDamstra@Lynnwood.org
7|54896387521172800|Kanagaki|Rebecca||2860 D Mt. Hood Circle|San Andres|DF|13343|Mexico|515-555-6247|RebeccaKanagaki@Tlaxiaco.org
8|47859687539744377|Brunner|Kim|H.|6064 Brodia Court|San Andres|DF|12942|Mexico|411-555-6825|Kim@Brunner@San Andresorg
9|35698487544797658|Blumberg|Brenda|C.|7560 Trees Drive|Sooke|BC|$$|Canada|815-555-3975|BrendaBlumberg@Richmond.org
10|36521487568712234|Stanz|Darren|M.|1019 Kenwal Rd.|$$#|OR|82017|USA|847-555-5443|DarrenStanz@Lake Oswego.org
...

Configuring the tDataShuffling component

Procedure

Double-click tDataShuffling to display the Basic settings view and define the component properties.
Click Sync columns to retrieve the schema defined in the input component.
In the Shuffling columns table, click the [+] button to add four rows, and then:
- in the Column, select the columns where data will be shuffled,
- in the Group ID, select the group identifier for each column. The columns having the same group identifier are shuffled together.
In the above example, there are two groups of columns to be shuffled:
- Group ID 1: credit_card
- Group ID 2: lname, fname and mi
The Job will replace credit card numbers within the credit_card column with values from different rows. It will also keep last names, first names and middle initial values, from the lname, fname and mi columns together and replace them with values from different rows.
Click the Advanced settings tab.

In the Partitioning columns table, click the [+] button to add one row.

The Job will shuffle the original data rows sharing the same value for the partitioning columns.

In the above example, the component is configured to apply the shuffling process to the rows sharing the same value for the country column.

Configuring the output component and executing the Job

Procedure

Double-click the tFileOutputExcel component to display the Basic settings view and define the component properties.
Set the destination file name as well as the sheet name and then select the Define all columns auto size check box.
Save your Job and press F6 to execute it.
The tDataShuffling component shuffles data in the selected columns and writes the result in an output file.
Right-click the output component and select Data Viewer to display the shuffled data.

tDataShuffling outputs shuffled data. tDataShuffling shuffles values within the first group of columns (credit_card) and within the second group of columns (lname, fname and mi).

The shuffling process only applies to the rows sharing the same values for the country column, as defined in the component advanced settings.

Sensitive personal information in the input data has been shuffled but data still looks real and consistent. The shuffled data is still usable for purposes other than production.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here