Skip to main content
Version: Next

File Based Lineage

Certified

Important Capabilities

CapabilityStatusNotes
Column-level LineageSpecified in the lineage file.
Table-Level LineageSpecified in the lineage file.

This plugin pulls lineage metadata from a yaml-formatted file. An example of one such file is located in the examples directory here.

CLI based Ingestion

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: datahub-lineage-file
config:
# Coordinates
file: /path/to/file_lineage.yml
# Whether we want to query datahub-gms for upstream data
preserve_upstream: False

sink:
# sink configs

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
file 
string
File path or URL to lineage file to ingest.
preserve_upstream
boolean
Whether we want to query datahub-gms for upstream data. False means it will hard replace upstream data for a given entity. True means it will query the backend for existing upstreams and include it in the ingestion run
Default: True

Lineage File Format

The lineage source file should be a .yml file with the following top-level keys:

version: the version of lineage file config the config conforms to. Currently, the only version released is 1.

lineage: the top level key of the lineage file containing a list of EntityNodeConfig objects

EntityNodeConfig:

  • entity: EntityConfig object
  • upstream: (optional) list of child EntityNodeConfig objects
  • fineGrainedLineages: (optional) list of FineGrainedLineageConfig objects

EntityConfig:

  • name: identifier of the entity. Typically name or guid, as used in constructing entity urn.
  • type: type of the entity (only dataset is supported as of now)
  • env: the environment of this entity. Should match the values in the table here
  • platform: a valid platform like kafka, snowflake, etc..
  • platform_instance: optional string specifying the platform instance of this entity

For example if dataset URN is urn:li:dataset:(urn:li:dataPlatform:redshift,userdb.public.customer_table,DEV) then EntityConfig will look like:

name : userdb.public.customer_table
type: dataset
env: DEV
platform: redshift

FineGrainedLineageConfig:

  • upstreamType: type of upstream entity in a fine-grained lineage; default = "FIELD_SET"
  • upstreams: (optional) list of upstream schema field urns
  • downstreamType: type of downstream entity in a fine-grained lineage; default = "FIELD_SET"
  • downstreams: (optional) list of downstream schema field urns
  • transformOperation: (optional) transform operation applied to the upstream entities to produce the downstream field(s)
  • confidenceScore: (optional) the confidence in this lineage between 0 (low confidence) and 1 (high confidence); default = 1.0

FineGrainedLineageConfig can be used to display fine grained lineage, also referred to as column-level lineage, for custom sources.

You can also view an example lineage file checked in here

Code Coordinates

  • Class Name: datahub.ingestion.source.metadata.lineage.LineageFileSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for File Based Lineage, feel free to ping us on our Slack.