Discovering data and semantic types
When you add a dataset, Talend Data Preparation automatically suggests, for each column, the best data types or semantic types that match the data.
Discovering semantic types
To display the percentage for each semantic type, in the sample view of your dataset, click the icon.
This feature is also available from the Hierarchy view.
How is the percentage calculated?
-
One percentage represents the number of values
matching the semantic type; up to 100% allocated.
To determine if a value matches a semantic type, the data discovery depends on the type of the semantic type:
- Dictionary: Does the value match a value from the dictionary? Punctuation, case, spaces, and accents are ignored.
- Regular expression: Does the value match the regular expression?
- Compound: is the value discovered into at
least one child?A compound type is a group of existing semantic types, called children.
If the answer is positive, the value is considered valid.
- The other percentage represents the
similarity between the column name and the name of the semantic type; up to 10%
allocated. To compare the names:The maximum percentage is 100%. If all values match a semantic type and the column name is identical to the name of the semantic type, the result still is 100%.
- The Levenshtein algorithm is used. It calculates the minimum number of edits (insertion, deletion, or substitution) required to transform one string into another.
- The case and accents are ignored.
- If the strings contain spaces, the word order is ignored. For example, US Phone and Phone US are considered identical.
Displaying the quality bar
Discovering native data types
Native data type | Description | Example |
---|---|---|
Text | String text | username |
Integer | Numeric value | 123 |
Decimal | Decimal numeric value | 1.26 |
Date | Date including day, month and year | 11/08/2022 |
Time | Time of the day | 11am |
Timestamps | Date and time | 11/08 11:00 |
Boolean | Answers with the value True or False | True |
- Is the value empty?
- Is the value of type boolean? true and false are the only values considered of type boolean.
- Is the value of type integer?
- Is the value of type decimal?
- Is the value of type date?
- If the value is not of one of the above types, it is considered a text value.
As the verification is incremental, a value is only of one type. For example, the value 5 is of type integer. It will not be considered of type text.