Word-based patterns

Talend Data Preparation allows you to analyze the word-based patterns repartition in your data.

The word-based patterns indicators are case sensitive. The following table describes what the patterns that you can find in the profiling area corresponds to:

Pattern	Description
[Word]	Word starting with an uppercase character and consisting of lowercase characters
[WORD]	Word with uppercase characters
[word]	Word with lowercase characters
[Char]	Single uppercase character
[char]	Single lowercase character
[Ideogram]	One of the CJK Unified Ideographs
[IdeogramSeq]	Sequence of ideograms
[hiraSeq]	Sequence of Japanese Hiragana characters
[kataSeq]	Sequence of Japanese Katakana characters
[hangulSeq]	Sequence of Korean Hangul characters
[digit]	One of the Arabic numerals: 0,1,2,3,4,5,6,7,8,9
[number]	Sequence of digits

The following examples illustrate how certain records would be interpreted in Talend Data Preparation.

String	Pattern
A character is NOT a Word	[Char] [word] [word] [WORD] [char] [Word]
someWordsINwORDS	[word][Word][WORD][char][WORD]
Example123@domain.com	[Word][number]@[word].[word]
anotherExample8@domain.com	[word][Word][digit]@[word].[word]
袁花木蘭88	[Ideogram] [IdeogramSeq][number]
Latin2中文	[Word][digit][IdeogramSeq]
Latin3フランス	[Word][digit][kataSeq]
Latin4とうきょう	[Word][digit][hiraSeq]
Latin5나는 한국 사람입니다	[Word][digit][hangulSeq]

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here