Skip to main content Skip to complementary content

Data generation functions

You can generate output data different from the input data.

Function Random generation Consistent generation Bijective generation Input data validation
Generate from pattern Yes No No No
Generate Uuid Yes No No No
Generate sequence Yes No No No
Generate from file/list Yes Yes No No

Generate from pattern

This function generates a value based on a user-defined pattern. This function only applies on Strings.
Option Description
Extra parameter This function requires an extra parameter.

The extra parameter is a pattern that follows those rules:

  • A is replaced with a random Latin uppercase letter.
  • a is replaced with a random Latin lowercase letter.
  • 9 is replaced with a random digit.
  • H is replaced with a random Hiragana character.
  • K is replaced with a random full-width Katakana character.
  • k is replaced with a random half-width Katakana character.
  • C is replaced with a random Kanji character.
  • G is replaced with a random Hangul character.

All other characters are copied to the generated value as is.

For more information about the supported character types and the related Unicode ranges, see Data masking functions in the masking components.

You can also use numbered backreferences (\\<number>) using the following syntax: <pattern>\\<number>,<group1>,<groupN>.

  • <pattern> corresponds to the pattern to be used for generating the output value.
  • \\<number> is a numbered backreference. <number> identifies the position of the group placed after the "," character.
  • <group1>,<groupN> are comma-separated groups of characters. Each group is treated as a single unit. If a backreference calls a group, it is added as is in the generated value.

If you want to copy a character used in patterns (A, a, 9, H, h, K, k, C, G) as is in the generated value, use a backreference.

This function does not work correctly if a comma ',' is used in the pattern.

In the following example:
  • a characters are replaced with random Latin lowercase letters.
  • s characters are not masked in the generated output.
  • \\2 calls the group placed after the second "," character, which is @talend.com.
Input value Extra parameter Example of a masked value
A26 "aaaass\\2,@gmail.com,@talend.com" hjdfss@talend.com
In the following example:
  • \\3 calls the group placed after the third "," character, which is a.
  • 9 characters are masked with random digits.
Input value Extra parameter Example of a masked value
A26 "\\39999,D,Z,a" a4825

Generate UUID

This function masks the input value with a randomly generated universally unique identifier (UUID).

This function uses the UUID.randomUUID() method provided by Java. This Java method does not use a seed, meaning that if you run the job twice, the function generates different UUIDs.

This function is applied on Strings.

This function requires no extra parameter.

In the following example, the masked valued is a randomly generated UUID.

Input value Example of a masked value
A26 28e92000-aafa-4ec3-bd56-240f192a4a8c

Generate sequence

This function returns the extra parameter, and, for each row, will increase this number by 1.

This function can be applied on all data types but Date (Integer, Long, Strings, etc.).

Information noteNote: This function is not supported in the Spark version of the component.
Option Description
Extra parameter This function requires an extra parameter.

The extra parameter must be a number.

If the extra parameter is not a number, it is set to 0.

In the following example, the generated sequence starts with the number set as an extra parameter.
Input values Extra parameter Examples of masked values
21

A48

"0" 0

1

Generate from file/list

This function randomly replaces the input value with one of the user-defined values.

This function is applied to Strings or numerical data types.

Option Description
Method The Randomly method randomly selects the value from the list (or file). As a result, two similar input values can be masked with the different output values.

The Consistently method ensures that two similar input values are masked with the same output value.

When using the Consistently method, the probability of generating duplicates can be calculated using the following formulas:
  • P = 1 if K < N, or
  • P = 1-K*(K-1)*(K-2)*…*(K-N+1) / K^N

where P is the probability of generating duplicates, N the input data size and K is the size of the input list given as a parameter.

Using this approach, it is possible to calculate the probability to find a pair sharing the same value within a group.

For example, the probability that, in a group of n people, two people have the same birthday is the following:
  • 2.7% in a group of 5 people,
  • 41.1% in a group of 20 people,
  • 100% in a group of 367 people, since there are 366 possible birthdays, including February 29.
Extra parameter This function requires an extra parameter.
The extra parameter can be:
  • a comma-separated list of two values minimum; or
  • a path to a file containing the values.

The values must be stored in a String and separated by commas, for example: "item1, item2, item3, etc.". This function uses the hashCode() method provided by Java to choose an element from the list.

If you use the Apache Spark version of the component, set the file path as follows:
  • In local mode:
    • Apache Spark 3.1 and earlier: prefix://file path or file:///file path.
    • Apache Spark 3.2 and later: file:///file path.
  • In Standalone and Yarn modes: prefix://file path.
  • If the index is on a cluster: hdfs://hdpnameservice1/file path.

Paths to folders are not supported.

If the extra parameter is not set, the function returns an empty String or 0.

In the following example, the masked value is one of the values set as extra parameters.

Input value Method Extra parameter Examples of a masked value
21 Randomly "help,documentation" help

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!