Managing datasets

You can manage the datasets included in Landing, Storage, Transform, Data mart, and Replication data tasks to create transformations, filter the data, and add columns.

The included datasets are listed under Datasets in the Design view. You can select which columns to display with the column picker ().

Datasets in the Design view of a data task

Transformation rules and explicit transformations

You can perform both global and explicit transformations .

Transformation rules

You can perform global transformations by creating a transformation rule that uses % as a wild card in the scope to apply to all matching datasets.

Click Rules, and then Add rule to create a new transformation rule.

For more information, see Creating rules to transform datasets.

Transformation rules are indicated by a dark purple corner on the affected attribute.

Explicit transformations

Explicit transformations are created:

When you use Edit to change a column attribute
When you use Rename on a dataset.
When you add a column.

Explicit transformations override global transformations, and are indicated by a light purple corner on the affected attribute.

Dataset models

Datasets can be either source-based or target-based, depending on the task type and the operations in the task. The dataset model used affects the behavior of the pipeline at source changes, and which operations you can perform.

Source-based datasets

The dataset is based on source datasets and will only hold changes in metadata. A change in source data is automatically applied which may cause changes in all downstream tasks. It is not possible to change column order, or changing source dataset.

The following task types always use a source-based dataset model: Landing, Storage, Registered data, Replication, and Landing in a data lake.
Target-based datasets

The dataset is based on the target metadata. If a column is added from the source, or removed, it is not automatically applied to the next downstream task. You can change column order, and change the source dataset. This means the task is more self-contained, and lets you control the effect of source changes.

The following task types can use a target-based dataset model: Transform, Data mart. There are some cases where a source-based model is used for Transform tasks based on the operation.
- If a SQL transformation or a transformation flow performs a column selection, the dataset will be target-based. For example, if you use SELECT A, B, C from XYZ in a SQL transformation, or use the Select columns processor in a transformation flow.
- If the default columns are maintained, the dataset is source-based. For example, if you use SELECT * from XYZ in a SQL transformation.

Updating projects from a source-based model to a target-based model

Existing projects are updated to the target-based dataset model when applicable. You will be guided through the update process when first opening a project. There are some considerations when importing and exporting projects with different dataset models.

It is not possible to import a project with a source-based model to a project with a target-based model.

Import the project with a source-based model to a new project, update the new project, and then export the resulting project. You can now re-import this project to the project with a target-based model.
It is not possible to import a project with a target-based model to a project with a source-based model.

Update the project to a target-based model before importing a project with a target-based model.

Filtering a dataset

The ability to filter a dataset is available for Landing data tasks that land data through Qlik Data Gateway - Data Movement, Storage and Transform data tasks.

You can filter data to create a subset of rows, if required.

Click Filter.

For more information, see Filtering a dataset.

Renaming a dataset

You can rename a dataset.

Click on a dataset, and then Edit.

Adding columns

You can add columns with row-level transformations, if required.

Click Add column

For more information, see Adding columns to a dataset.

Editing a column

You can edit column properties by selecting a column and clicking Edit.

Name
Key

Set a column to be a primary key. You can also set keys by selecting or deselecting in the Key column.
Nullable
Data type

Set the data type of the column. For some data types, you can set an additional property, for example, Length.

Understanding the impact of changing a data type

There are two common use cases for changing data type size or switching to a different data type:

Ingesting data that will not fit into the current data type.
A requirement for greater numerical accuracy. For example, changing SMALLINT to DECIMAL (p,s).

In most cases, changing a data type will result in an ALTER TABLE operation, thereby preventing data loss. For example, if the previous data type was STRING (25) and the new data type is STRING (50), the data in the column with the new data type will be updated without any issue. However, in some cases, changing a data type will result in the table being dropped and recreated. For example, if the previous column data type is NUMBER and you change it to DATE, the table will be dropped and recreated as converting numbers to dates is not possible. Likewise, if the target platform does not support ALTER table operations (such as Databricks), the table will be dropped and recreated.

There are some cases where it is theoretically possible to alter a table, but because of the underlying complexity, the data task will drop and recreate the table. Finally, there are cases where potential data loss rather than actual data loss will trigger a drop and recreate operation. For example, if you change STRING(25) to STRING(1), data loss will occur if the ingested data does not fit into STRING(1). However, it's possible that STRING(25) will always contain only one character, so in practice no data loss will occur, but the table will still be dropped and recreated due to potential data loss.

Changes to data types that require the table to be dropped and recreated, regardless of target platform

Changing the following data types will result in the table being dropped and recreated:

BYTES
BLOB
CLOB
NCLOB

Target platforms that support changes to data type size without dropping and recreating the table

When working with Snowflake, Google BigQuery, Amazon Redshift, Microsoft SQL Server, and Azure Synapse Analytics, you can change the size of certain data types without the table being dropped and recreated. The following table lists which data types are supported for each of the aforementioned platforms.

Data type	Snowflake	Google BigQuery	Azure Synapse Analytics	Microsoft SQL Server	Amazon Redshift
INT1	No	Yes	Yes	Yes	No
INT2	No	Yes	Yes	Yes	No
INT4	No	Yes	Yes	Yes	No
INT8	No	Yes	Yes	Yes	No
REAL4	No	No	Yes	Yes	No
REAL8	No	No	Yes	Yes	No
UINT1	No	Yes	Yes	Yes	No
UINT2	No	Yes	Yes	Yes	No
UNIT4	No	Yes	Yes	Yes	No
UNIT8	No	Yes	Yes	Yes	No
NUMERIC	Yes	Yes	Yes	Yes	No
STRING	Yes	Yes	Yes	Yes	Yes
WSTRING	No	Yes	Yes	Yes	No

Target platforms that support changing data types to STRING without the table being dropped and recreated

When moving data to Microsoft SQL Server and Azure Synapse Analytics, you can change the following data types to STRING, without the table being dropped and recreated:

BOOLEAN
DATE
TIME
DATETIME
INT1
INT2
INT4
INT8
REAL4
REAL8
UINT1
UINT2
UNIT4
UNIT8
NUMERIC
WSTRING (Supported with Azure Synapse Analytics only)

Removing columns

You can remove one or more columns from a dataset.

Select the columns to remove, and click Remove.

If you want to see removed columns, click Show removed columns. Removed columns are indicated with strike-through text. You can retrieve a removed column by selecting it, and clicking Revert.

To remove an added column, select it and click Revert.

Reverting explicit changes to columns

You can revert all explicit changes to one or more columns.

Select the columns to revert changes to, and click Revert.

Changes from global transformation rules will not be reverted.

If you revert an added column, it will be removed.

Dataset settings

You can change settings for the dataset. The default setting is to inherit the setting of the data asset, but you can also change a setting to be explicitly On or Off.

Click on a dataset, and then Settings.

Viewing data

You can view a sample of the data to see and validate the shape of your data as you are designing your data pipeline.

The following requirements must be met:

Viewing data is enabled on tenant level in Administration.

Enable Settings > Feature control > Viewing data in Data Integration.
You are assigned the Can view data role in the space where the connection resides.
You are assigned the Can view role in the space where the project resides.

When projects and connections are created in a Personal space, the owner can always view the data.

To view sample data in Datasets tab in the Design view.:

Click View data in Physical objects.

A sample of the data is displayed. You can set how many data rows to include in the sample with Number of rows.

To change between datasets and tables:

Select Datasets to view the logical representation of the data.
Select Physical objects to view the physical representation in the database as tables and views.

News noteThis option is not available if the physical representation is not yet created.

You can filter the sample data in two ways:

Use to filter which sample data to retrieve.

For example, if you use the filter ${OrderYear}>2023 and Number of rows is set to 10, you will get a sample of 10 orders from 2024.
Filter the sample data by a specific column.

This will only affect the existing sample data. If you used to only include orders from 2024, and set the column filter to show orders from 2022, the result is an empty sample.

You can also sort the data sample by a specific column. Sorting will only affect the existing sample data. If you used filter to only include orders from 2024 and invert the sort order, the sample data will still only contain orders from 2024.

You can hide columns in the data view:

Hide a single column by clicking on the column, and then Hide column.
Hide several columns by clicking on any column, and then Display columns. This lets you control visibility for all columns in the view.

Validating and adjusting the datasets

You can validate all datasets that are included in the data task.

Expand Validate and adjust to see all validation errors and design changes.

Validating the datasets

Click Validate datasets to validate the datasets.

Validation includes checking that:

All tables have a primary key
There are no missing attributes.
There are no duplicate table or column names.

You will also get a list of design changes compared to the source:

Added tables and columns
Dropped tables and columns
Renamed tables and columns
Changed primary keys and data types

Expand Validate and adjust to see all validation errors and design changes.

Fix the validation errors, and then validate the data sets again.
Most design changes can be adjusted automatically, except changed primary keys or data types. In this case, you need to sync the datasets.

Preparing the datasets

You can prepare datasets to adjust design changes with no data loss if possible. If there are design changes that cannot be adjusted without data loss, you will get the option to recreate tables from source with data loss.

This requires stopping the task.

Click , then Prepare.

When the datasets are prepared, validate the datasets before restarting the storage task.

Recreating datasets

You can recreate the datasets from the source. When you recreate a dataset, there will be data loss. As long as you have the source data, you can reload it from the source.

This requires stopping the task.

Click , then Recreate.

Limitations

In Google BigQuery, if you delete or rename a column, this will recreate the table and lead to data loss.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here