Skip to main content
Version: Next

SQL Queries

Incubating

Important Capabilities

CapabilityStatusNotes
Column-level LineageParsed from SQL queries
Table-Level LineageParsed from SQL queries

This source reads a newline-delimited JSON file containing SQL queries and parses them to generate lineage.

Query File Format

This file should contain one JSON object per line, with the following fields:

  • query: string - The SQL query to parse.
  • timestamp (optional): number - The timestamp of the query, in seconds since the epoch.
  • user (optional): string - The user who ran the query. This user value will be directly converted into a DataHub user urn.
  • operation_type (optional): string - Platform-specific operation type, used if the operation type can't be parsed.
  • downstream_tables (optional): string[] - Fallback list of tables that the query writes to, used if the query can't be parsed.
  • upstream_tables (optional): string[] - Fallback list of tables the query reads from, used if the query can't be parsed.

Example Queries File

{"query": "SELECT x FROM my_table", "timestamp": 1689232738.051, "user": "user_a", "downstream_tables": [], "upstream_tables": ["my_database.my_schema.my_table"]}
{"query": "INSERT INTO my_table VALUES (1, 'a')", "timestamp": 1689232737.669, "user": "user_b", "downstream_tables": ["my_database.my_schema.my_table"], "upstream_tables": []}

Note that this file does not represent a single JSON object, but instead newline-delimited JSON, in which each line is a separate JSON object.

CLI based Ingestion

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

datahub_api:  # Only necessary if using a non-DataHub sink, e.g. the file sink
server: http://localhost:8080
timeout_sec: 60
source:
type: sql-queries
config:
platform: "snowflake"
default_db: "SNOWFLAKE"
query_file: "./queries.json"

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
platform 
string
The platform for which to generate data, e.g. snowflake
query_file 
string
Path to file to ingest
default_db
string
The default database to use for unqualified table names
default_dialect
string
The SQL dialect to use when parsing queries. Overrides automatic dialect detection.
default_schema
string
The default schema to use for unqualified table names
platform_instance
string
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://datahubproject.io/docs/platform-instances/ for more details.
env
string
The environment that all assets produced by this connector belong to
Default: PROD
usage
BaseUsageConfig
The usage config to use when generating usage statistics
Default: {'bucket_duration': 'DAY', 'end_time': '2025-01-10...
usage.bucket_duration
Enum
Size of the time window to aggregate usage stats.
Default: DAY
usage.end_time
string(date-time)
Latest date of lineage/usage to consider. Default: Current time in UTC
usage.format_sql_queries
boolean
Whether to format sql queries
Default: False
usage.include_operational_stats
boolean
Whether to display operational stats.
Default: True
usage.include_read_operational_stats
boolean
Whether to report read operational stats. Experimental.
Default: False
usage.include_top_n_queries
boolean
Whether to ingest the top_n_queries.
Default: True
usage.start_time
string(date-time)
Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on bucket_duration). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'.
usage.top_n_queries
integer
Number of top queries to save to each table.
Default: 10
usage.user_email_pattern
AllowDenyPattern
regex patterns for user emails to filter in usage.
Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
usage.user_email_pattern.ignoreCase
boolean
Whether to ignore case sensitivity during pattern matching.
Default: True
usage.user_email_pattern.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
usage.user_email_pattern.allow.string
string
usage.user_email_pattern.deny
array
List of regex patterns to exclude from ingestion.
Default: []
usage.user_email_pattern.deny.string
string

Code Coordinates

  • Class Name: datahub.ingestion.source.sql_queries.SqlQueriesSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for SQL Queries, feel free to ping us on our Slack.