Lineage Trace Header Options
Lineage Flow Type
The Type in the upper left of the lineage display provides a selection between either:
- DATA FLOW - Based upon connection definitions to data stores and physical transformation rules which transform and move the data)
- SEMANTIC FLOW - Based upon the definition and usage type relationships from a term, concept or logical Model to a physical representation.
- OVERVIEW – Based upon a view of the design level lineage limited to the scope of the model you invoked it on (by clicking on the Lineage tab) and thus is not a complete end-to-end lineage picture, but simply an overview of the model lineage picture.
Both data flow and semantic flow may be present in a diagram.
Lineage Direction
Generally, lineage is represented as a “flow”, either of data as part of a data movement and possibly transformation process, or of “meaning” as in from a defining object like a glossary term to a defined object like a column. These directions are commonly also associated with analysis of the lineage, hence:
- Data Flow lineage
- Forward or destinations or target or impact lineage of the data movement and transformation processes. Represented as being to the right of the point of origin.
- Reverse or source lineage of the data movement and transformation processes. Represented as being to the left of the point of origin.
- Semantic Flow lineage
- Forward or target or usage or defined lineage of the application of meaning or documentation or inheritance. Represented as being below (and many times to the right of) the point of origin.
- Reverse or source or origin or definition lineage of the application of meaning or documentation or inheritance. Represented as being below (and many times to the right of) the point of origin.
Direction only makes sense for a lineage trace, not a lineage overview.
Control Flow
Generally, lineage is represented as a “flow”, either of data as part of a data movement and possibly transformation process, or of “meaning” as in from a defining object like a glossary term to a defined object like a column. These directions are commonly also associated with analysis of the lineage, hence:
Control flow is lineage that traces from an object used as part of a selection WHERE clause or similar structure that impacts what data is moved but is not itself directly moved to the target. There are two types of control flow:
- Column control flow where the control flow directly impacts values of column (e.g., lookup)
- Row control flow where the control flow does not directly impact values of columns (e.g., filters).
It is easy to imagine a common scenario where you trace data impact and your impact trace affects a commonly used (in terms of joins and WHERE clauses) dimension, e.g., the time dimension in the warehouse, mart or otherwise. Just about every report will be using that dimension in some way, and thus the impact lineage is basically everything. In this case the diagram size quickly grows out of the capability of your browser to present the lineage let alone navigate and analyze it.
For this and other similar reasons, the same menu as above includes options to limit the lineage.
Talend Data Catalog may be used as an active data catalog, providing:
Control Lineage Option | Description | Delay in Presentation |
None | No control flow data impacts are traced | None |
Limited | Show only immediate (adjacent) control flow objects | Maybe slow |
Complete | All control flow impacts are traced | Likely slow |
Steps
- Begin a lineage trace.
- In Control Flow, you may:
- Click None to hide any object which are only connected via control flow and not show any control flow links.
- Click Limited to show any objects which are directly connected to the origin object via control flow and show those control flow links.
- Click Complete to show any objects which are connected via control flow to the origin object and any subsequent objects and show those control flow links.
- If Limited control flow display is enabled, then go to the lineage Diagram and click on target elements and the control flow that the target depends upon will appear.
Example
Search for the OnPrem DW.dbo.Customer table and open it.
Go to the Lineage tab and ensure that the Type is DATA FLOW and the View is DIAGRAM.
There is a red “pin” in the diagram, showing the point of origin, from which lineage is presented. In this case, the Customer table.
Finally, ensure that the Control Flow is NONE:
Click Columns HIDE and select the top checkbox to show all the columns in the Customer table.
Then, expand the Staging DW model to the column table level using the minus sign.
At this time, the diagram does not contain any control lineage artifacts, as we specified.
Now, specify Control Flow as Limited:
And expand the Staging DW model again to the column level.
Many new objects, which are not directly connected by data flow links now appear. Selecting Data Flow Settings >Control Flow > Limited shows any control flow related objects which are directly connected to the origin object via control flow.
And we see control lineage as different (dashed) lines.
One must click on an object to see the control lineage.
Click the Dimensional DW.dbo.Customer.ID column.
And we see control lineage source columns in gray shading.
Now, specify Control Flow as COMPLETE, and expand the Accounting model to the column level and Click the Dimensional DW.dbo.Customer.ID column.
Even more objects are now shown in the lineage diagram and we also have gray shading when a column that is impacted by control lineage is selected.
Include Columns
In order to manage massive lineage graphs with tens of thousands of objects and edges (lines), you may specify whether to include or exclude the columns as part of the graph.
Selecting No could reduce the complexity of the lineage graph and thus the computation and download time on the backend (application server) side, as well as the diagram rendering time (in the UI).
Steps
- Begin a lineage trace.
- In Columns, you may:
- Click No to exclude all column level lineage
- Click Yes to include all column level lineage.
Selecting No after have rendered a diagram with Yes specified will cause the lineage to be recalculated by the backend and redownloaded and then rendered by the UI all over again.
Example
Lineage Depth
Pick the Depth in the pull-down in the upper right.
- 1 (Adjacent) step in the lineage. Objects in the lineage that are the next items in a lineage trace.
For impact, adjacent can often be the data store (like a warehouse) that is the target of an object being loaded by DI/ETL that is the focus of the lineage. For course lineage, it can often mean the data source directly loaded from to produce the object that is the focus of the lineage.
- 2 thru 9 steps in the lineage
- Any type for both data impact and lineage.
Lineage Display As
Lineage Filter
The ability to present a manageable amount of information targeted for analysis is a critical concern with lineage diagrams. In particular, for larger diagrams (such as those originating in a central fact in a warehouse, or a commonly used time based dimension), filtering is crucial if you do not want spaghetti diagrams, memory faults (remember, the actual lineage diagram must be presented by your local browser and its memory limitations), or simply huge wait times for the diagram to appear.
You have several filters from which several choices are available
Each filter option shows a number adjacent with the number of objects that would be filtered out of the diagram if that filter is enabled.
- SHOW INTERNAL OBJECTS to show any intermediate schemas/tables/columns between connections in the lineage, such as transformations in an ETL pipeline
- SHOW EXTERNAL OBJECTS to show any external source tables or files which an object in the lineage from which the object is derived, such as data lake files from which HIVE derives tables
- SHOW TEMPORARY OBJECTS to show intermediate temporary tables/columns in the lineage such as temporary data store objects which are created and then deleted as part of the data movement process
- SHOW EXTERNAL TABLE LOCATION OBJECTS to include objects which are only external table locations that require connection resolution.
- EXCLUDE MODEL TYPES to not show specific types of models in the lineage
- EXCLUDE MODELS to not show specifically selected models.
Please see the discussion on handling large diagrams.
Lineage Filter Options
One may include or filter out various object types in order to focus only on specific types of objects in the lineage.
Click Edit Filters and specify:
- SHOW TEMPORARY OBJECTS to show intermediate temporary tables/columns in the lineage
- SHOW INTERNAL OBJECTS to show any intermediate schemas/tables/columns between connections in the lineage
- SHOW EXTERNAL OBJECTS to show any external source tables or files which an object in the lineage from which the object is derived
- SHOW EXTERNAL TABLE LOCATION OBJECTS to include objects which are only external table locations that require connection resolution.
- EXCLUDE MODEL TYPES to not show specific types of models in the lineage
- EXCLUDE MODELS to not show specifically selected models.
In some cases you may see that a lineage diagram is taking an excessive amount of time to display or that you are presented with the message:
This large diagram has xxxxx objects and xxxxx links which may require more resources that what your browse case handle.
You may use the PROCEED ANYWAY button to try to visualize the diagram.
You may also save these settings as defaults in future lineage traces.
Steps
- Begin a lineage trace.
- Click Edit Filters and specify:
- SHOW TEMPORARY OBJECTS to show intermediate temporary tables/columns in the lineage
- SHOW INTERNAL OBJECTS to show any intermediate schemas/tables/columns between connections in the lineage
- SHOW EXTERNAL OBJECTS to show any external source tables or files which an object in the lineage from which the object is derived
- SHOW EXTERNAL TABLE LOCATION OBJECTS to include objects which are only external table locations that require connection resolution.
- EXCLUDE MODEL TYPES to not show specific types of models in the lineage
- EXCLUDE MODELS to not show specifically selected models.
Show Internal/External Objects
Lineage reporting may
- either Show Internal Objects within a model (e.g., interim steps in transformations) or just the objects stitched to other model objects.
- either Show External Objects that are not directly material to the lineage trace (such as the link from files in HDSF to the tables representing them in Hive) or not show these objects.
Show Temporary Objects
Big data solutions and other ETL/DI processes use temporary files and tables routinely. When harvesting, Talend Data Catalog detects temporary files and marks them as TEMPORARY in their lineage characteristics. This fact means that you can distinguish temporary objects from permanent/stitchable ones in a lineage diagram and, optionally hide/show them.
Show External Table Location Objects
Models may refer to external tables that require connection resolution. By default, these table location objects are not shown. You may use this option to explicitly show them.
Default View
This option allows you to save the current filter setting to be the default for future trace reports.
Lineage Type (Display As)
Use the Lineage Trace Header Options including specify the lineage presentation
Saving Lineage Results
You may save a lineage graph to be shared and referred to later. This reduces the time required to read from the database and regenerate a lineage graph for larger diagrams.