tChunking Standard Properties
These properties are used to configure tChunking running in the Standard Job framework.
The Standard tChunking component belongs to the AI family.
Basic settings
Schema and Edit Schema |
A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields. Click Sync columns to retrieve the schema from the previous component connected in the Job. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:
|
|
Built-In: You create and store the schema locally for this component only. |
|
Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs. |
Split column | Select the column that must be broken down into chunks. |
Chunking method | Select the method from the drop-down list:
The chunks will be the exact defined size or less. |
Chunk size | Enter the maximum number of characters or tokens that a chunk must contain. When the
Chunking method is:
|
Chunk overlap |
Enter the number of characters or tokens that should overlap between two adjacent chunks. This number must be less than the Chunk size. Overlapping ensures continuity and context between chunks, and prevents the segmentation from disrupting the flow and coherence of the text. The overlap is based on the chunking method, Fixed-size or Token-based. |
Tokenizer | This option is available when Token-based is selected as
Chunking method. Select the model from the drop-down list. Hugging Face may take more time than the other models. |
Advanced settings
tStatCatcher Statistics |
Select this check box to gather the Job processing metadata at the Job level as well as at each component level. |
Usage
Usage rule |
This component is used as an intermediate step. It requires an input and output flows. |