Data Flow Lineage Trace
This method of analysis presents either graphical or textual representation of the flow of data through connection definitions to data stores and physical transformation rules which transform and move the data. In order to see data flow lineage, one must
- Define a configuration that contains all of the models potentially in the data flow
- Stitch the models together by resolving connection definitions and Build the configuration
Once the configuration is ready, then you are ready to report on lineage.
In the Data Lineage Diagram, all columns/fields of a given table/file are presented at once which matches the classic data modeling concepts. Selection of a given column/field allows a user to highlight the data flow to it.
End-to-end data flow lineage across models is only available at the classifier (e.g., table) and feature (e.g., column) level. If instead one goes to the object page for a schema or model, as this is not classifier or feature, the data flow tab shows the overview lineage within the scope of that model only.
A data flow lineage trace presents summary lineage as opposed to the data flow overview lineage which presents a step by step transformation lineage.
When you trace impact/lineage of a table or column, you do not see all the transformations. Instead, you see a summary of the whole job (you get a picture much closer to the one for an architecture diagram). But, you are also able to see complete end-to-end lineage (not just confined to one DI or BI model).
Finally, the tool does not display constants on the lineage diagram. In particular, this means that if a constant appears as a source for lineage and that process only has that constant as a source for a lineage trace, you will not see that process in the lineage trace.
Steps
- Sign in as a user which has at least the Metadata Viewing or Data Managementcapability object role assignment to the configuration and all its contained models.
Without the Metadata Viewing capability object role assignment to all the configuration’s contained model, you will see a dialog indicating that you do not have sufficient privileges.
- Find a starting point for lineage by either
- Navigating to that element’s object page and select the Data Flow tab
- Or, for lists of elements, click the line the element is on and click the appropriate Trace Data Flow icon
- Or, right click on the element in a diagram (architecture diagram, lineage diagram or model diagram) and select Trace Lineage > Data Flow
- From here you may
- Use the common lineage trace functions
- Specify the lineage presentation
- Switch to
- Data Lineage - trace from an object upstream to objects that provide data flow to that object
- Data Impact - trace from an object downstream to objects that are impacted via data flow by that object
- Full Data Lineage - Both of the above.
Data Flow Tree
Select the Tree tab on the left to obtain this presentation.
Next to SHOW, you will see a list of objects or processes:
- Objects data store object types, e.g., tables, columns, views, fields, files, etc.
- Processes data movement and possibly transformation processes, e.g. mappings, transformations, computation, select/inserts, etc.
The scope of that list is based upon the choice of direction of the trace which are impact (forward) or lineage(sources) or the business intelligence (BI) reports, as well as the proximity in the trace:
- Adjacent objects/processes in the lineage which are the next items in a lineage trace. For impact, that can often be the data store (like a warehouse) that is the target of an object being loaded by DI/ETL that is the focus of the lineage. For course lineage, it can often mean the data source directly loaded from to produce the object that is the focus of the lineage.
- Ultimate end objects/processes are the final nodes in the lineage where the trace stops. For impact, this often means report fields, for source lineage it often means operational system tables and columns.
- Reports objects/projects in the lineage which are part of the business intelligence type reports generally at the far end of the lineage trace.
- All objects/projects in the lineage which are part of the business intelligence type reports generally at the far end of the lineage trace.
Steps
- Trace data flow lineage.
- Click the Tree tab on the left.
- From here you may
- Pick the options next to SHOW in the upper left, as defined above.
- Click the Download icon to download the entire textual results to CSV format.
- Expand the details panel to see an equivalent of the Overview tab for the object page of a selected object or process.
Example
Data Flow Tree Objects
Search for the DW Staging.Customer table, go to the object page and then the Data Flow tab. Click the Tree tab on the left. Click Objects and Ultimate next to SHOW.
The Lineage (Sources) panel shows the Customer table in the Accounting.MITI-Finance-AR datastore along with the two files in the Data Lake, which together comprise the ultimate sources for this Customer table in Staging DW.
The Impact (Destinations) panel shows the ultimate reports using data from the Customer table.
Click Adjacent.
The Lineage (Sources) panel still shows the Customer table in the Accounts Receivable model as it was not only the ultimate source for this table in Staging DW, but also was the adjacent one.
The Impact (Destinations) panel shows the tables in the Dimensional DW data store, instead of going to the ultimate destination, which were the reports.
Now, click the Diagram tab on the right to see the full picture of the lineage.
Now, one can see that why the similar results on the Lineage (Sources) panel as there is really only one step (adjacent) to the ultimate sources.
This example is a fairly simple demo. One can imagine the value of using the Tree tab for more realistic (and then much more complex) lineage examples from real environments.
Return to the Tree tab and click Ultimate.
Expand the Details panel on the far right and select the Finance1 app in the Qlik Sense Cloud model.
Now we see a representation of the contents of the Overview tab of the object page, but presented as a panel in the lineage display.
You may now click on the Open in Tool as in the examples with BI tools further in the user guide.
Data Flow Tree Processes
Now click Processes.
There are four processes that are immediately before in the data flow and one process immediately after.
Click the first item in the Processes (Sources) list, which is named Mapping.
This precursor process is actually Talend DI process reading from the accounts receivable operational data store and writing to the Staging DB (for which we were looking at the lineage).
Go to the Data Flow tab for this process to produce an overview lineage diagram for that process:
This process includes a number of parallel pipelines to various tables in Staging DW, including the Customer table.
As it is a data flow overview diagram (not a lineage trace), there are several pipelines shown, but the scope is just within the DI/ETL model.
Click the Back arrow in the browser to return to the original Tree based lineage trace.
Data Flow Diagram
The Classic Data Lineage Diagram can be overly crowded in today's data lake architectures where it is common to find tables/files with over hundred columns/fields. Furthermore, the large number of tables/files involved may generate too many objects in a readable graph, giving rise to possible warning in the user interface.
The data flow "interactive" Analysis Diagram displays the columns/fields involved in the given data flow trace, not all the columns. The user can then select the columns/fields to be displayed to better present the business use case of that data flow. Then the user can interact within that diagram by selecting columns/fields to display its lineage. Furthermore, the Analysis Diagrams allow you to display conditional labels such as PII or Confidential SensitivityLevel, not only providing more critical information to the user, but also better visualization of the propagation of that information (e.g. PII) through the data flow lineage trace.
The data flow analysis diagraming feature presents a graphical representation of the flow of data through connection definitions to data stores and physical transformation rules which transform and move the data. In order to see data flow lineage, one must
- Define a configuration that contains all of the models potentially in the data flow
- Stitch the models together by resolving connection definitions and Build the configuration
Once the configuration is ready, then you are ready to report on lineage.
End-to-end data flow lineage across models is only available at the classifier (e.g., table) and feature (e.g., column) level. If instead, one goes to the object page for a schema or model, , as this is not classifier or feature, the data flow tab shows the overview lineage within the scope of that model only.
Steps
- Trace data flow lineage.
- Click the Analysis Diagram tab on the left side.
- From here you may
- Pick the Direction in the pull-down in the header of the diagram:
- Impact (Destination) direction
- Lineage (Sources) direction
- Any type for both data impact and lineage.
- Select which columns to display in the diagram using the Columns pull-down in the header of the diagram.
- A list of possible columns with a quick find is presented with checkboxes.
- Pick the Depth in the pull-down in the upper right.
- 1 (Adjacent) step in the lineage. Objects in the lineage that are the next items in a lineage trace.
- Pick the Direction in the pull-down in the header of the diagram:
For impact, adjacent can often be the data store (like a warehouse) that is the target of an object being loaded by DI/ETL that is the focus of the lineage. For course lineage, it can often mean the data source directly loaded from to produce the object that is the focus of the lineage.
- 2 thru 9 steps in the lineage
- Any type for both data impact and lineage.
- Click the Show actions for the selected object icon and
- Select Show/Hide Columns to show columns in the selected object, or all objects if none is selected.
- Select Expand/Collapse All to expand the display of the selected object down to the current display level (columns or tables) or collapse to the highest level. Applies to all objects if none is selected.
- Click Save an image to produce a downloadable file with a lineage image.
- Click Filters and specify lineage filter options.
- Click Display Options and specify lineage display options.
Example
Navigate to the object page for the Customer table in the Staging DW.dbo schema.
Go to the Data Flow tab and click the Diagram tab on the left side.
Pick ANY for the Direction in the pull-down in the diagram header.
The red colored pin indicating the source of the lineage and impact trace.
The diagram defaults to the classifier (table) level for performance reasons.
Click the Show actions for the selected object icon and select Show Columns.
Now columns are visible, but still not the column lines. Again, this is for reasons of performance and simplicity of presentation.
Click on the Display Options icon and click Show Conditional Labels
Here, you may pick and choose conditional labels to show in the diagram and the image shows all of them selected for display.
Click on the Display Options icon and select Show Term Definitions
Terms, like US Social Security Number (documenting the ID field), are used to document columns and tables that are in this lineage trace and this is shown in the diagram.
Data Lineage Diagram Display Options
You may control the display of lineage objects and their presentation using the lineage Display Options menu.
Here you may see the terms with Defined by relationships.
Show/Hide Columns
Click on the Display Options icon and click Show Conditional Labels
Here, you may pick and choose conditional labels to show in the diagram and the image shows all of them selected for display.
Show Mixed Connections
See Show Mixed Connections in the Classic Data Lineage Diagram.
Maximum Node Width
See Maximum Note Width in the Data Lineage Diagram.
Lineage Diagram Trace in General
Select the Analysis Diagram tab on the left to obtain this presentation. You will see a graphical presentation of the lineage (data impact or data source).
Additional options include:
Overview
You may click this icon to show or hide an Overview panel of the lineage trace diagram. Click in the overview to quickly move to a portion of the full diagram.
Zoom In/Out and Fit to content
Click Zoom in or Zoom out icons to adjust the aspect ratio of the diagram. Also, you may click on the Fit to content icon to view the entire diagram at the best zoom that will fit.
Collapse / Expand
Click Expand / Collapse to expand or collapse the entire diagram (ensure that you do not have an object selected, otherwise the action will only apply to that object).
You may also click on the plus sign for an object to expand and the minus sign to collapse just that object.
Open the object page
You may right-click and select Open (),to navigate to the object page.
You may download a PNG or SVG image of the diagram.
Quick find
In the upper right, there is a search text box that will provide a quick list of object names that contain the text you type. You may click on any of the results to select that object in the diagram and moving the focus there.
Explore Further
Invoking a lineage trace from any reference to a object
You may invoke a lineage trace from any diagram or any list of results (e.g., from a Browse or Search), either via right-click context menu
Interpreting the graphical lineage
In general, the lineage tools within Talend Data Catalog function identically whether one is analyzing data flow lineage, semantic lineage or both. However, the presentation is different, as follows:
In addition, Talend Data Catalog has four levels of presentation:
- Configuration Model Connections Overview – which is a diagram representing the various Models contained within a configuration and how they are related (or stitched) to each other based upon connection definitions manually assigned to Talend Data Catalog .
- Model Connections Overview – which is a diagram representing the various Models contained within the directory of an external repository and how they are related (or stitched) to each other based upon connection definitions already provided in the external metadata repository.
- Model Lineage Overview – which is a diagram representing an overview of the lineage within a given Model.
- Lineage Trace analysis at the configuration or Model level – which is a fully detailed trace of semantic and/or data flow lineage for detailed analysis.
Properties Panel
Click to select a object and view its properties in the Properties Panel on the right. You may show and hide this panel as needed.
Data Flow Classic Diagram
This method of analysis presents a graphical representation of the flow of data through connection definitions to data stores and physical transformation rules which transform and move the data. To see data flow lineage, one must
- Define a configuration that contains all of the models potentially in the data flow
- Stitch the models together by resolving connection definitions and Build the configuration
Once the configuration is ready, then you are ready to report on lineage.
End-to-end data flow lineage across models is only available at the classifier (e.g., table) and feature (e.g., column) level. If instead, one goes to the object page for a schema or model, , as this is not classifier or feature, the data flow tab shows the overview lineage within the scope of that model only.
This is an older methodology for presenting a lineage trace. You are highly encouraged to us the newer method as the Classic diagram does not scale well with larger diagrams and number of objects.
You may disable this feature in the UI by setting the Show Lineage Classic Diagram in group preference to false for the group Everyone.
These are the analysis type use cases, generally posed as questions such as:
- Given an item on a report, what data entry system fields impact these results?
- Why are the numbers on this report the way that they are?
- How to change the system data to get the correct results for this report?
This type of analysis, i.e., asking where the information comes from, is a question posed “upstream” in the dataflow. We refer to it as a reverse lineage question. When consumers of these reports ask these questions, a correct and responsive answer may be the most valuable information provided by a metadata management environment.
Steps
- Trace data flow lineage.
- Click the Diagram tab on the left.
- From here you may
- Pick the Type in the pull-down in the upper right.
- Data Impact type
- Data Lineage type
- Full Data Lineage type for both data impact and lineage.
- Click the More Options icon and
- select Show/Hide Columns to show columns in the selected object, or all objects if none is selected
- select Expand/Collapse All to expand down to the current display level (columns or tables) or collapse to the highest level.
- Click Save an image to produce a downloadable file with a lineage image
- Click Edit Filters and specify lineage filter options.
- Click Display Options and specify lineage display options.
- Pick the Type in the pull-down in the upper right.
Example
Search for the Net Vendor CustomerInvoices Tableau worksheet and open it.
Go to the Data Flow tab.
This is a business intelligence report and thus is at the end of the lineage, so Talend Data Catalog automatically chooses Data Lineage for lineage Type.
The End Objects tab on the left is selected in this case, so we see the textual tree-based report.
Click Collapse all to reduce the tree to the top five elements in the lineage.
Now, click the Diagram tab on the left. Click the Collapse Selected node completely () icon.
The different lineage indicate different types of data flow processes
Click the plus sign next to MITI-Finance-AP.dbo (Database) in Accounting (Model).
Click the plus sign next to Invoice (Table) in MITI-Finance-AP.dbo (Database).
You then see the exact column that is a source in the lineage trace.
Click in an empty space in the diagram to de-select Invoice, then select the To Column level expansion, which will now apply to all objects.
Select a column, then click Highlight to outline the paths through that object.
Click the black line between Adjustments.Adj.TransAmt and Staging DW.dbo.GLAccount. AccountAmountAvailable.
And you see the transformation at the bottom of the page.
You may also simply pass the pointer over a link and see summary information.
Many times, one may ask these forward lineage or impact analysis type of questions:
- If I make a change to this field, what reports will be impacted?
- How is this identity information merged with the personnel system information on these other reports?
A data flow impact report traces the manner in which data flows from source to destination.
Steps
- Trace data flow lineage.
- Click Data Impact in the Type pull-down in the upper right.
- From here you may
- Pick the Type in the pull-down in the upper right.
- Data Impact type
- Data Lineage type
- Full Data Lineage type for both data impact and lineage.
- Click the More Options icon and
- select Show/Hide Columns to show columns in the selected object, or all objects if none is selected
- select Expand/Collapse All to expand down to the current display level (columns or tables) or collapse to the highest level.
- Click Save an image to produce a downloadable file with a lineage image
- Click Edit Filters and specify lineage filter options.
- Click Display Options and specify lineage display options.
- Pick the Type in the pull-down in the upper right.
Example
Navigate to the object page for the file PAYTRANS.csv (a search string must be enclosed in quotation marks as the period (.) has special meaning in the search syntax, e.g. "PAYTRANS.csv") and the semantic search must be disabled.
Then click the Data Flow tab and Diagram tab on the left. Note, the Impact type is automatically selected, as the PAYTRANS.csv file is an ultimate source in the configuration, so it does not have any source lineage.
This option provides the combination of both:
- Data Lineage (trace from an object upstream to objects that provide data flow to that object)
- Data Impact (trace from an object downstream to objects that are impacted via data flow by that object)
Based upon all the lineage flows that trace though the selected object (feature or classifier).
Steps
- Trace data flow lineage.
- Click Full Data Lineage in the Type pull-down in the upper right.
- From here you may
- Pick the Type in the pull-down in the upper right.
- Data Impact type
- Data Lineage type
- Full Data Lineage type for both data impact and lineage.
- Click the More Options icon and
- select Show/Hide Columns to show columns in the selected object, or all objects if none is selected
- select Expand/Collapse All to expand down to the current display level (columns or tables) or collapse to the highest level.
- Click Save an image to produce a downloadable file with a lineage image
- Click Edit Filters and specify lineage filter options.
- Click Display Options and specify lineage display options.
- Pick the Type in the pull-down in the upper right.
Example
The Full Data Lineage option is the default. However, as it may take more time to render, you may disable it in the Group Preferences.
If disable, you may enable it. Sign in as Administrator. Go to MANAGE > Groups. Select the group named Everyone. Go to the Preferences tab and click Add and specify the Enable Full Data Lineage preference.
Click OK. Set the Value to true and click SAVE.
Search for “Customer” and pick the table Dimensional DW > dbo > Customer.
Go to the Data Flow tab.
The Data flow tab has double arrows next to it, indicating that there are both impact and lineage traces for this object.
Select Full Data Lineage.
You have all the lineage traces going through that object. The object from which the lineage is determined is marked with a red pin.
Data Lineage Diagram Display Options
You may control the display of lineage objects and their presentation using the lineage Display Options menu.
Sometimes an ETL/DI process will read from one table and write to another table in the same database. In this case, the lineage will often show process arrows that loop back because the normal presentation is to group tables inside their respective schemas. However, by default, the lineage in the Data Flow tab will attempt to produce a continuous diagram from left to right through the lineage by breaking up these tables giving a more understandable lineage picture.
When you click this checkbox the lineage is returned to the mode where all tables are grouped into their respective schemas and thus loops are shown.
Steps
The default setting where this option is not checked is presented.
- Check Display Options > Show Mixed Connections to show these objects
The lineage is returned to the mode where all tables are grouped into their respective schemas and thus loops are shown
- You may uncheck to return to the default.
In many cases, names of objects may be too long to fit into the objects in the diagram. You may specify several different node width maximums to make the diagram more readable. Click on Display Options.
Highlight Control Links
Checking this option means that anytime you are highlighting a trace, the control links will be included in the highlighting.
Lineage Diagram Trace in General
Select the Diagram tab on the left to obtain this presentation. You will see a graphical presentation of the lineage (data impact or data source) with round edges boxes representing nodes, many contained within larger boxes (container structures, e.g., a schema contains several tables). Then, connecting lines denote the lineage flow. In general, the lineage tools within Talend Data Catalog function identically whether one is analyzing data flow lineage, semantic lineage or both. However, the presentation is different, as follows:
Overview
You may click this icon to show or hide an Overview panel of the lineage trace diagram. Click in the overview to quickly move to a portion of the full diagram.
Zoom In/Out and Fit to content
Click Zoom in or Zoom out icons to adjust the aspect ratio of the diagram. Also, you may click on the Fit to content icon to view the entire diagram at the best zoom that will fit.
Collapse / Expand
Click Expand icons to expand the entire diagram (ensure that you do not have an object selected, otherwise the action will only apply to that object).
Click on Collapse collapse all objects to the highest level.
You may also click on the plus sign for an object to expand and the minus sign to collapse just that object.
Highlight
Click to select an object in the trace and then click the Highlight () icon to highlight the path through that selected object.
You may double-click or perform a long click on the Highlight icon to lock it in place and the path will highlight for any subsequently selected object.
Focus the lineage trace
You may focus the lineage trace to only include that portion of the trace that passes through another object in the diagram. Click to select an object, click More Actions and click on the Only show the selected node ancestors and descendants icon ().
To remove the focus and return to the entire lineage diagram simply click to close the dialog stating “Currently focusing on…”
Open the object page
You may double-click, right-click and select Open (), or select the object and use the Open icon to navigate to the object page.
With the newer Diagram lineage, easier is to simply expand the Details panel.
Trace lineage from another object
You may re-trace the lineage from any object in the diagram. Select the object and use the Trace Lineage icon () to restart the trace from that point with that type of trace.
You may download a PNG or SVG image of the diagram.
Quick find
In the upper right, there is a search text box that will provide a quick list of object names that contain the text you type. You may click on any of the results to select that object in the diagram and moving the focus there.
Explore Further
Invoking a lineage trace from any reference to a object
You may invoke a lineage trace from any diagram or any list of results (e.g., from a Browse or Search), either via right-click context menu
Interpreting the graphical lineage
In general, the lineage tools within Talend Data Catalog function identically whether one is analyzing data flow lineage, semantic lineage or both. However, the presentation is different, as follows:
In addition, Talend Data Catalog has four levels of presentation:
- Configuration Model Connections Overview – which is a diagram representing the various Models contained within a configuration and how they are related (or stitched) to each other based upon connection definitions manually assigned to Talend Data Catalog .
- Model Connections Overview – which is a diagram representing the various Models contained within the directory of an external repository and how they are related (or stitched) to each other based upon connection definitions already provided in the external metadata repository.
- Model Lineage Overview – which is a diagram representing an overview of the lineage within a given Model.
- Lineage Trace analysis at the configuration or Model level – which is a fully detailed trace of semantic and/or data flow lineage for detailed analysis.
Properties Panel
Click to select a object and view its properties in the Properties Panel on the right. You may show and hide this panel as needed.