Business Glossary
This plugin pulls business glossary metadata from a yaml-formatted file. An example of one such file is located in the examples directory here.
CLI based Ingestion
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
source:
  type: datahub-business-glossary
  config:
    # Coordinates
    file: /path/to/business_glossary_yaml
    enable_auto_id: true # recommended to set to true so datahub will auto-generate guids from your term names
# sink configs if needed
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field | Description | 
|---|---|
| file ✅ One of string, string(path) | File path or URL to business glossary file to ingest. | 
| enable_auto_id boolean | Generate guid urns instead of a plaintext path urn with the node/term's hierarchy. Default: False | 
The JSONSchema for this configuration is inlined below.
{
  "title": "BusinessGlossarySourceConfig",
  "type": "object",
  "properties": {
    "file": {
      "title": "File",
      "description": "File path or URL to business glossary file to ingest.",
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "string",
          "format": "path"
        }
      ]
    },
    "enable_auto_id": {
      "title": "Enable Auto Id",
      "description": "Generate guid urns instead of a plaintext path urn with the node/term's hierarchy.",
      "default": false,
      "type": "boolean"
    }
  },
  "required": [
    "file"
  ],
  "additionalProperties": false
}
Business Glossary File Format
The business glossary source file should be a .yml file with the following top-level keys:
Glossary: the top level keys of the business glossary file
Example Glossary:
version: "1"                                                # the version of business glossary file config the config conforms to. Currently the only version released is `1`.
source: DataHub                                         # the source format of the terms. Currently only supports `DataHub`
owners:                                                 # owners contains two nested fields
  users:                                                # (optional) a list of user IDs
    - njones
  groups:                                               # (optional) a list of group IDs
    - logistics
url: "https://github.com/datahub-project/datahub/"      # (optional) external url pointing to where the glossary is defined externally, if applicable
nodes:                                                  # list of child **GlossaryNode** objects. See **GlossaryNode** section below
    ...
GlossaryNode: a container of GlossaryNode and GlossaryTerm objects
Example GlossaryNode:
- name: Shipping                                                # name of the node
  description: Provides terms related to the shipping domain    # description of the node
  owners:                                                       # (optional) owners contains 2 nested fields
    users:                                                      # (optional) a list of user IDs
      - njones
    groups:                                                     # (optional) a  list of group IDs
      - logistics
  nodes:                                                        # list of child **GlossaryNode** objects
    ...
  knowledge_links:                                              # (optional) list of **KnowledgeCard** objects
    - label: Wiki link for shipping
      url: "https://en.wikipedia.org/wiki/Freight_transport"
GlossaryTerm: a term in your business glossary
Example GlossaryTerm:
- name: FullAddress                                                          # name of the term
  description: A collection of information to give the location of a building or plot of land.    # description of the term
  owners:                                                                   # (optional) owners contains 2 nested fields
    users:                                                                  # (optional) a list of user IDs
      - njones
    groups:                                                                 # (optional) a  list of group IDs
      - logistics
  term_source: "EXTERNAL"                                                   # one of `EXTERNAL` or `INTERNAL`. Whether the term is coming from an external glossary or one defined in your organization.
  source_ref: FIBO                                                          # (optional) if external, what is the name of the source the glossary term is coming from?
  source_url: "https://www.google.com"                                      # (optional) if external, what is the url of the source definition?
  inherits:                                                                 # (optional) list of **GlossaryTerm** that this term inherits from
    -  Privacy.PII
  contains:                                                                 # (optional) a list of **GlossaryTerm** that this term contains
    - Shipping.ZipCode
    - Shipping.CountryCode
    - Shipping.StreetAddress
  custom_properties:                                                        # (optional) a map of key/value pairs of arbitrary custom properties
    - is_used_for_compliance_tracking: "true"
  knowledge_links:                                                          # (optional) a list of **KnowledgeCard** related to this term. These appear as links on the glossary node's page
    - url: "https://en.wikipedia.org/wiki/Address"
      label: Wiki link
  domain: "urn:li:domain:Logistics"                                            # (optional) domain name or domain urn
To see how these all work together, check out this comprehensive example business glossary file below:
Example business glossary file
version: "1"
source: DataHub
owners:
  users:
    - mjames
url: "https://github.com/datahub-project/datahub/"
nodes:
  - name: Classification
    description: A set of terms related to Data Classification
    knowledge_links:
      - label: Wiki link for classification
        url: "https://en.wikipedia.org/wiki/Classification"
    terms:
      - name: Sensitive
        description: Sensitive Data
        custom_properties:
          is_confidential: "false"
      - name: Confidential
        description: Confidential Data
        custom_properties:
          is_confidential: "true"
      - name: HighlyConfidential
        description: Highly Confidential Data
        custom_properties:
          is_confidential: "true"
        domain: Marketing
  - name: PersonalInformation
    description: All terms related to personal information
    owners:
      users:
        - mjames
    terms:
      - name: Email
        ## An example of using an id to pin a term to a specific guid
        ## See "how to generate custom IDs for your terms" section below
        # id: "urn:li:glossaryTerm:41516e310acbfd9076fffc2c98d2d1a3"
        description: An individual's email address
        inherits:
          - Classification.Confidential
        owners:
          groups:
            - Trust and Safety
      - name: Address
        description: A physical address
      - name: Gender
        description: The gender identity of the individual
        inherits:
          - Classification.Sensitive
  - name: Shipping
    description: Provides terms related to the shipping domain
    owners:
      users:
        - njones
      groups:
        - logistics
    terms:
      - name: FullAddress
        description: A collection of information to give the location of a building or plot of land.
        owners:
          users:
            - njones
          groups:
            - logistics
        term_source: "EXTERNAL"
        source_ref: FIBO
        source_url: "https://www.google.com"
        inherits:
          - Privacy.PII
        contains:
          - Shipping.ZipCode
          - Shipping.CountryCode
          - Shipping.StreetAddress
        related_terms:
          - Housing.Kitchen.Cutlery
        custom_properties:
          - is_used_for_compliance_tracking: "true"
        knowledge_links:
          - url: "https://en.wikipedia.org/wiki/Address"
            label: Wiki link
        domain: "urn:li:domain:Logistics"
    knowledge_links:
      - label: Wiki link for shipping
        url: "https://en.wikipedia.org/wiki/Freight_transport"
  - name: ClientsAndAccounts
    description: Provides basic concepts such as account, account holder, account provider, relationship manager that are commonly used by financial services providers to describe customers and to determine counterparty identities
    owners:
      groups:
        - finance
    terms:
      - name: Account
        description: Container for records associated with a business arrangement for regular transactions and services
        term_source: "EXTERNAL"
        source_ref: FIBO
        source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
        inherits:
          - Classification.HighlyConfidential
        contains:
          - ClientsAndAccounts.Balance
      - name: Balance
        description: Amount of money available or owed
        term_source: "EXTERNAL"
        source_ref: FIBO
        source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Balance"
  - name: Housing
    description: Provides terms related to the housing domain
    owners:
      users:
        - mjames
      groups:
        - interior
    nodes:
      - name: Colors
        description: "Colors that are used in Housing construction"
        terms:
          - name: Red
            description: "red color"
            term_source: "EXTERNAL"
            source_ref: FIBO
            source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
          - name: Green
            description: "green color"
            term_source: "EXTERNAL"
            source_ref: FIBO
            source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
          - name: Pink
            description: pink color
            term_source: "EXTERNAL"
            source_ref: FIBO
            source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
    terms:
      - name: WindowColor
        description: Supported window colors
        term_source: "EXTERNAL"
        source_ref: FIBO
        source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
        values:
          - Housing.Colors.Red
          - Housing.Colors.Pink
      - name: Kitchen
        description: a room or area where food is prepared and cooked.
        term_source: "EXTERNAL"
        source_ref: FIBO
        source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
      - name: Spoon
        description: an implement consisting of a small, shallow oval or round bowl on a long handle, used for eating, stirring, and serving food.
        term_source: "EXTERNAL"
        source_ref: FIBO
        source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
        related_terms:
          - Housing.Kitchen
        knowledge_links:
          - url: "https://en.wikipedia.org/wiki/Spoon"
            label: Wiki link
Source file linked here.
Generating custom IDs for your terms
IDs are normally inferred from the glossary term/node's name, see the enable_auto_id config. But, if you need a stable
identifier, you can generate a custom ID for your term. It should be unique across the entire Glossary.
Here's an example ID:
id: "urn:li:glossaryTerm:41516e310acbfd9076fffc2c98d2d1a3"
A note of caution: once you select a custom ID, it cannot be easily changed.
Compatibility
Compatible with version 1 of business glossary format. The source will be evolved as we publish newer versions of this format.
Code Coordinates
- Class Name: datahub.ingestion.source.metadata.business_glossary.BusinessGlossaryFileSource
- Browse on GitHub
Questions
If you've got any questions on configuring ingestion for Business Glossary, feel free to ping us on our Slack.