Defining data lineage with Atlas
If you are using Hortonworks Data Platform V2.4 onwards to run your MapReduce or Spark Batch Jobs and Apache Atlas has been installed in your Hortonworks cluster, you can make use of Atlas to trace the lineage of given data flow to discover how this data was generated by a Job.
This linage includes the components used in this Job and the schema changes between the components.
This type of Job is available only if you have subscribed to any Talend product with Big Data or to Talend Data Fabric
- Hortonworks Data Platform V2.4, Talend Studio supports Atlas 0.5 only.
- Hortonworks Data Platform.V2.5, Talend Studio supports Atlas 0.7 only.
Procedure
With this option activated, you need to set the following parameters:
-
Atlas URL: enter the location of the Atlas to be connected to. It is often http://name_of_your_atlas_node:port
-
In the Username and Password fields, enter the authentication information for access to Atlas.
-
Set Atlas configuration folder: if your Atlas cluster contains custom properties such as SSL or read timeout, select this check box, and in the displayed field, enter a directory in your local machine, then place the atlas-application.properties file of your Atlas in this directory. This way, your Job is enabled to use these custom properties.
You need to ask the administrator of your cluster for this configuration file. For further information about this file, see the Client Configs section in Atlas configuration.
-
Die on error: select this check box to stop the Job execution when Atlas-related issues occur, such as connection issues to Atlas.
Otherwise, leave it clear to allow your Job to continue to run.
- Hortonworks Data Platform V2.4, Talend Studio supports Atlas 0.5 only.
- Hortonworks Data Platform.V2.5, Talend Studio supports Atlas 0.7 only.
Results
The time when you run this Job, the lineage will be automatically generated in Atlas.
When the execution of the Job is done, perform a search in Atlas for the lineage information written by this Job and read the lineage there.
-
the Job itself
-
the components in the Job that are using data schemas, such as tRowGenerator or tSortRow. The connection or configuration components such as tHDFSConfiguration are not taken into account since these components do not use schemas.