Standardizing values using dictionaries

Standardizing values in cells with autocompletion

When editing a cell in a column which semantic type is based on a dictionary or a compound type, the application will automatically suggest values that are part of this dictionary as you type, to ensure that the whole column follows the same standard.

In this example, you are working on a dataset containing customer data, including US state codes. Because all the occurrences of Texas have not been correctly entered in the two-letter code format, part of the data is considered invalid in the State column, as shown by the quality bar. You are going to isolate the rows containing errors, edit one of the cells with autocompletion, and apply the change to all identical cells to fix the whole column in a single operation.

Dataset containing customer information including US state codes.

Before you begin

The autocompletion menu is only available if the semantic type of the column is based on a dictionary of values or a compound type present by default in Talend Data Preparation, or that you have created using Talend Dictionary Service. For more information on how to create custom semantic types, or edit the existing ones, see Enriching the semantic types libraries.

Procedure

In the State column, click the orange part of the quality bar and click Select rows with invalid values for State to apply a filter and isolate the rows with Texas.
In the State column, double-click one of the occurrences of Texas.
You can now edit the content of the cell. When you delete the previous value, a drop-down list opens, suggesting all the values that are part of the US State Code dictionary-based semantic type, sorted alphabetically.
Because the code for Texas is not part of the first results that are displayed, start typing Tto see which results come up.
The suggestions will become more accurate as you type. But if the expected value can already be found from the original list, you can directly select it.
From this refined list, select the state code that corresponds to Texas, namely TX.
Select the check box Apply to all cells with this value that appears once you have finished typing or selecting a value.
The change you have made in this cell will now also apply to the other incorrect values.
Click Submit.

Results

Thanks to the suggestions, you are ensured that the new value you applied is in the correct format. All the other incorrect occurrences of Texas have been replaced by the correct TX state code in a single operation. The quality bar now indicates that all the data in the State column is valid.

Automatically standardizing values in a column

You can use the Standardize value (fuzzy matching) to find the closest valid value for invalid values within a column.

The function checks the invalid data contained in a column against the current semantic type, and retrieves the correct values, if the selected matching threshold is achieved. This function is only available if the semantic type is based on a dictionary of values or compound type present by default in Talend Data Preparation, or that you have created using Talend Dictionary Service. For more information on how to create custom semantic types, or edit the existing ones, see Enriching the semantic types libraries.

Let's say that you have to work on a dataset containing various information on customers based in the United-States, such as their names, email addresses and the State they live in.

As you can see in the header of the State column, the data has been recognized as US States, but as shown in the quality bar, some of the entries contain invalid names.

State column highlighted with invalid names.

In a single action, you will fix those invalid values, and replace them with the correct value from the US State dictionary, or semantic type, that contains an exhaustive list of all the US States.

Note: The Standardize value (fuzzy matching) function does not support Asian characters.

Procedure

Click the header of the State column in order to select its content.
In the functions panel, type Standardize values and click the result to open the options for the associated function.
In the Match threshold drop-down list, select the matching percentage that must be achieved between the incorrect value and correct value for the substitution to happen.
The three following percentage values are available:
- High: Only values that have a 90% match or more with the correct value are replaced.
- Default: Only values that have a 80% match or more with the correct value are replaced.
- None: Replaces the invalid value with the closest valid value.
The Levenshtein algorithm is used to match the data. In the case of a composed string, the matching process is actually divided in four parts:
1. A search occurs on the full string and on each tokens.
2. Dictionary values that have less than a 3-character difference to the full string or one token are returned.
3. A distance on the possible pairs is computed to return the best one.
4. The user threshold filters the results according to the distance.
Example
- Clermont Talend matches with Clermont thanks to the first token.
- Clermont-Ferra matches with Clermont-Ferrand because there are less than three different characters compared to the full string.
- Clermon-Ferant matches with Clermont because there are more than three different characters compared to the full string but only one different character with the token Clermont.
Click the Preview button to preview the result of the function, and click Submit to apply it.

Results

The incorrect values have been standardized, using the dictionary of US States.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here

Standardizing values using dictionaries

Standardizing values in cells with autocompletion

Before you begin

Procedure

Results

Automatically standardizing values in a column

Procedure

Example

Results

Did this page help you?