Skip to main content Skip to complementary content

Discovering data and semantic types

When you add a dataset, Talend Data Preparation automatically suggests, for each column, the best data types or semantic types that match the data.

Discovering semantic types

The data discovery calculates how many values match each semantic type and, if the result is greater than 40%, it assigns the semantic type to the column.

To display the percentage for each semantic type, in the sample view of your dataset, click the menu icon.

Semantic type displayed for Address Line.

This feature is also available from the Hierarchy view.

How is the percentage calculated?

This percentage is the sum of two percentages:
  • One percentage represents the number of values matching the semantic type; up to 100% allocated.

    To determine if a value matches a semantic type, the data discovery depends on the type of the semantic type:

    • Dictionary: Does the value match a value from the dictionary? Punctuation, case, spaces, and accents are ignored.
    • Regular expression: Does the value match the regular expression?
    • Compound: is the value discovered into at least one child?
      A compound type is a group of existing semantic types, called children.

    If the answer is positive, the value is considered valid.

  • The other percentage represents the similarity between the column name and the name of the semantic type; up to 10% allocated.
    To compare the names:
    • The Levenshtein algorithm is used. It calculates the minimum number of edits (insertion, deletion, or substitution) required to transform one string into another.
    • The case and accents are ignored.
    • If the strings contain spaces, the word order is ignored. For example, US Phone and Phone US are considered identical.
    The maximum percentage is 100%. If all values match a semantic type and the column name is identical to the name of the semantic type, the result still is 100%.

Displaying the quality bar

The quality bar shows how many values are invalid, empty and valid according to the assigned semantic type. To display it, activate the Use for validation setting in the configuration of the semantic type.
  • From the Grid view:
    Quality bar showed from the Grid view.
  • From the Hierarchy view:
    Quality bar showed from the Hierarchy view.

The percentage of valid values might be less than the data discovery. This happens when:

  • The validation rule is more restrictive than the semantic type. In this case, the values match the values from the semantic types but, from the validation rule, the values do not match, e.g. case, punctuation.
  • The similarity between the column name and the name of the semantic type raises the result of the semantic type to 100%. In this case, the quality bar shows between 90% and 100% of valid values.

Discovering native data types

When no semantic types obtain a result of more than 40%, the data discovery assigns a data type.
List of the different native data types
Native data type Description Example
Text String text username
Integer Numeric value 123
Decimal Decimal numeric value 1.26
Date Date including day, month and year 11/08/2022
Time Time of the day 11am
Timestamps Date and time 11/08 11:00
Boolean Answers with the value True or False True
To determine of which type is a value, the data discovery follows an order:
  • Is the value empty?
  • Is the value of type boolean? true and false are the only values considered of type boolean.
  • Is the value of type integer?
  • Is the value of type decimal?
  • Is the value of type date?
  • If the value is not of one of the above types, it is considered a text value.

    As the verification is incremental, a value is only of one type. For example, the value 5 is of type integer. It will not be considered of type text.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!