Shuffling data values to restrict the use of actual sensitive data
With the tDataShuffling component, you can shuffle sensitive information to replace it with other values for the same column from a different row, allowing production data to be safely used for purposes such as testing and training.
This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real-Time Big Data Platform, Talend MDM Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.
- The tFixedFlowInput component to generate personal data including credit card numbers.
- The tDataShuffling component to shuffle original data and replace values with other values for the same column from a different row.
- The tFileOutputExcel component to output the shuffled dataset.
Prerequisites: Further restricting the use of sensitive data
When shuffling data, it is still advised to mask sensitive data. Remember also to consider relationships between the columns when shuffling data and make sure the original dataset cannot be reconstructed.
In this scenario, last names and first names are grouped together but the email adresses are not in the same group. Consequently, the email column does not relate to the lname and fname columns. Since the email column usually contains information about first names and last names, it may help attackers to reconstruct the original data.
Additionally, the address1, city and email columns are not in any group, so they were not shuffled. This means it is possible to infer, for example, that Robert Damstra lives at 1619 Stillman Court, Lynnwood.
-
To avoid the use of real credit card numbers, you can mask credit card numbers using the tDataMasking component.
-
To avoid the identification of customers with their email addresses, you can mask email addresses using the tDataMasking component.
-
To make it more difficult to read real addresses, you can add the address1 and city columns in other groups.
Setting up the Job
Procedure
- Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tDataShuffling and tFileOutputExcel.
- Connect the three components together using the Main links.