About Talend Cloud Data Preparation
Talend Cloud Data Preparation is a self-service application that enables information workers to cut hours out of their work day by simplifying and expediting the laborious and time-consuming process of preparing data for analysis or other data-driven tasks.
This cloud version runs on top of Talend Cloud and delivers enterprise-class capabilities together with connectivity to virtually any data source. It fosters collaboration between business people who know the data best and central organizations, like IT or Risk Management, that define the rules and policies for data accessibility and governance.
It includes:
- Integration and cataloging.
- Data Discovery and Profiling.
- Cleansing, standardizing and shaping.
- Enriching and connecting datasets.
- Operationalizing Data Preparation.
Talend Data Preparation concepts
- Connection: Connections are environments or systems where datasets are stored, including databases, file systems, distributed systems or platforms, etc. The connection information to these systems only need to be set up once since they are reusable.
- Dataset: A dataset holds the raw data that can be used as the raw material for one or more preparations. It is presented as a table on which you can apply recipe steps without affecting the original data. A dataset can be reused across preparations.
- Sample: Your data will be visible in the form of a sample, retrieved from the dataset metadata.
- Preparation: A preparation is what links a dataset and a recipe together: it is the final outcome that you want to achieve with your data. You can export this outcome as a file or connect it to data targets. A preparation takes one dataset and applies a recipe to produce an outcome. The original dataset is never modified.
- Recipe: A recipe is literally defined as "a set of directions with a list of ingredients for making or preparing something". In Talend Cloud Data Preparation, the ingredients are the raw data, called datasets, and the directions are the set of functions applied to the dataset. Visually, the recipe is the top-down sequence of functions in the left collapsible panel. A recipe is linked to the dataset through a preparation. Every update of the recipe is automatically saved in the preparation all the time.
- Function: A function is an action applied on a row, a column or the whole dataset such as removing empty rows. As functions are applied as part of a preparation, they do not modify the original data. Applied functions are recorded, in sequence, into recipes.
- Semantic type: The semantic type of a column or record corresponds to the type of data that can be found in it, such as names, zip codes, phone numbers, coordinates, etc. The Talend Cloud applications all benefit from semantic awareness, meaning that when you look at your sample data, it will be automatically categorized using the default semantic types, or the ones that you have created yourself.
- Cloud Engine for Design: The Cloud Engine for Design is a built-in runner that allows users to easily process data without having to set up any processing engines. With this engine you can run two objects in parallel. For advanced processing of data it is recommended to install the secure Remote Engine Gen2.
-
Remote Engine Gen2: A Remote Engine Gen2 is a secure execution engine on which you can safely run objects. It allows
you to have control over your execution environment and resources as you are
able to create and configure the engine in your own environment (Virtual Private
Cloud or on premises).
A Remote Engine ensures:
- Data processing in a safe and secure environment as Talend never has access to your data and resources.
- Optimal performance and security by increasing the data locality instead of moving large data to computation.
Relationship between connections, datasets, and preparations:
Talend Cloud Data Preparation architecture
The diagram is divided into two main parts: the local network and the cloud infrastructure.
Local network
The local network includes a web browser, Talend Studio, a Remote Engine Gen2, and a Runtime Server.
- From your web browser, you can access Talend Cloud Data Preparation, Talend Dictionary Service, and Talend Management Console.
- From Talend Studio, you can benefit from the Talend Cloud Data Preparation features through the use of the tDatasetInput, tDatasetOutput, and tDataprepRun components. You can create datasets from various databases and export them in Talend Cloud Data Preparation, or leverage a preparation directly in a data integration Job or Spark Job.
- The Remote Engine Gen 1 is used to run the Jobs that use the Data Preparation components, and run artifacts and tasks on premises.
- The Remote Engine Gen2 is used to run objects from the Talend Cloud applications, such as preparations, as well as creating connections and fetching data samples.
Cloud infrastructure
The cloud infrastructure includes Talend Cloud Data Preparation that relies on the Dataset service, and the Cloud Engine for Design.
- The Dataset service is what provides the unified dataset list for Talend Cloud Data Preparation, Talend Cloud Data Inventory and Talend Cloud Pipeline Designer.
- In Talend Management Console, you can administrate roles, users, projects, and licenses. You can create new users for the cloud applications and assign them to custom groups. You can then define roles and assign them to your users. Talend Management Console is also used to import your license files and create projects to collaborate on in Talend Studio. In addition, you can enable data and file transfer, data integration, and access to shared data sources for web users. You can, for example, import and use preconfigured sample Tasks, or design Tasks that automate the exchange and synchronization of data between applications.
- In Talend Cloud Data Preparation, you can import your data from local files or other sources, and cleanse or enrich it by creating new preparations.
- In Talend Dictionary Service, you can add, remove, or modify the semantic categories that are applied to each column in your data when opened in Talend Cloud Data Preparation.
- The Cloud Engine for Design is used to run artifacts, tasks and preparations in the cloud, as well as creating connections and fetching data samples.