Skip to main content Skip to complementary content

tChunking Standard Properties

Availability-noteBeta

These properties are used to configure tChunking running in the Standard Job framework.

The Standard tChunking component belongs to the AI family.

Basic settings

Schema and Edit Schema

A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word line when naming the fields.

Click Sync columns to retrieve the schema from the previous component connected in the Job.

Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this option to view the schema only.

  • Change to built-in property: choose this option to change the schema to Built-in for local changes.

  • Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion.

    If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the Repository Content window.

 

Built-In: You create and store the schema locally for this component only.

 

Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.

Split column Select the column that must be broken down into chunks.
Chunking method Select the method from the drop-down list:
  • Fixed-size: Each chunk contains the same number of characters.
  • Token-based: Each chunk contains the same number of tokens.

The chunks will be the exact defined size or less.

Chunk size Enter the maximum number of characters or tokens that a chunk must contain.
When the Chunking method is:
  • Fixed-size. The size is based on the number of characters. For example, How are you? equals 12 characters.
  • Token-based. The size is based on the token. How are you? equals 4 tokens. How, are, you and ? are 1 token each.
Chunk overlap

Enter the number of characters or tokens that should overlap between two adjacent chunks. This number must be less than the Chunk size.

Overlapping ensures continuity and context between chunks, and prevents the segmentation from disrupting the flow and coherence of the text.

The overlap is based on the chunking method, Fixed-size or Token-based.

Tokenizer This option is available when Token-based is selected as Chunking method.

Select the model from the drop-down list.

Hugging Face may take more time than the other models.

Advanced settings

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at the Job level as well as at each component level.

Usage

Usage rule

This component is used as an intermediate step. It requires an input and output flows.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!