Skip to main content
Version: Next

Cassandra

Incubating

Important Capabilities

CapabilityStatusNotes
Asset ContainersEnabled by default
Detect Deleted EntitiesOptionally enabled via stateful_ingestion.remove_stale_metadata
Platform InstanceEnabled by default
Schema MetadataEnabled by default

This plugin extracts the following:

  • Metadata for tables
  • Column types associated with each table column
  • The keyspace each table belongs to

Setup

This integration pulls metadata directly from Cassandra databases, including both DataStax Astra DB and Cassandra Enterprise Edition (EE).

You’ll need to have a Cassandra instance or an Astra DB setup with appropriate access permissions.

Steps to Get the Required Information

  1. Set Up User Credentials:

    • For Astra DB:
      • Log in to your Astra DB Console.
      • Navigate to Organization Settings > Token Management.
      • Generate an Application Token with the required permissions for read access.
      • Download the Secure Connect Bundle from the Astra DB Console.
    • For Cassandra EE:
      • Ensure you have a username and password with read access to the necessary keyspaces.
  2. Permissions:

    • The user or token must have SELECT permissions that allow it to:
      • Access metadata in system keyspaces (e.g., system_schema) to retrieve information about keyspaces, tables, columns, and views.
      • Perform SELECT operations on the data tables if data profiling is enabled.
  3. Verify Database Access:

    • For Astra DB: Ensure the Secure Connect Bundle is used and configured correctly.
    • For Cassandra Opensource: Ensure the contact point and port are accessible.
caution

When enabling profiling, make sure to set a limit on the number of rows to sample. Profiling large tables without a limit may lead to excessive resource consumption and slow performance.

note

For cloud configuration with Astra DB, it is necessary to specify the Secure Connect Bundle path in the configuration. For that reason, use the CLI to ingest metadata into DataHub.

CLI based Ingestion

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: "cassandra"
config:
# Credentials for on prem cassandra
contact_point: "localhost"
port: 9042
username: "admin"
password: "password"

# Or
# Credentials Astra Cloud
#cloud_config:
# secure_connect_bundle: "Path to Secure Connect Bundle (.zip)"
# token: "Application Token"

# Optional Allow / Deny extraction of particular keyspaces.
keyspace_pattern:
allow: [".*"]

# Optional Allow / Deny extraction of particular tables.
table_pattern:
allow: [".*"]

# Optional
profiling:
enabled: true
profile_table_level_only: true

sink:
# config sinks

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
contact_point
string
Domain or IP address of the Cassandra instance (excluding port).
Default: localhost
password
string
Password credential associated with the specified username.
platform_instance
string
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://datahubproject.io/docs/platform-instances/ for more details.
port
integer
Port number to connect to the Cassandra instance.
Default: 9042
username
string
Username credential with read access to the system_schema keyspace.
env
string
The environment that all assets produced by this connector belong to
Default: PROD
cloud_config
CassandraCloudConfig
Configuration for cloud-based Cassandra, such as DataStax Astra DB.
cloud_config.secure_connect_bundle 
string
File path to the Secure Connect Bundle (.zip) used for a secure connection to DataStax Astra DB.
cloud_config.token 
string
The Astra DB application token used for authentication.
cloud_config.connect_timeout
integer
Timeout in seconds for establishing new connections to Cassandra.
Default: 600
cloud_config.request_timeout
integer
Timeout in seconds for individual Cassandra requests.
Default: 600
keyspace_pattern
AllowDenyPattern
Regex patterns to filter keyspaces for ingestion.
Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
keyspace_pattern.ignoreCase
boolean
Whether to ignore case sensitivity during pattern matching.
Default: True
keyspace_pattern.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
keyspace_pattern.allow.string
string
keyspace_pattern.deny
array
List of regex patterns to exclude from ingestion.
Default: []
keyspace_pattern.deny.string
string
profile_pattern
AllowDenyPattern
Regex patterns for tables to profile
Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
profile_pattern.ignoreCase
boolean
Whether to ignore case sensitivity during pattern matching.
Default: True
profile_pattern.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
profile_pattern.allow.string
string
profile_pattern.deny
array
List of regex patterns to exclude from ingestion.
Default: []
profile_pattern.deny.string
string
table_pattern
AllowDenyPattern
Regex patterns to filter keyspaces.tables for ingestion.
Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}
table_pattern.ignoreCase
boolean
Whether to ignore case sensitivity during pattern matching.
Default: True
table_pattern.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
table_pattern.allow.string
string
table_pattern.deny
array
List of regex patterns to exclude from ingestion.
Default: []
table_pattern.deny.string
string
profiling
GEProfilingBaseConfig
Configuration for profiling
Default: {'enabled': False, 'operation_config': {'lower_fre...
profiling.enabled
boolean
Whether profiling should be done.
Default: False
profiling.include_field_distinct_count
boolean
Whether to profile for the number of distinct values for each column.
Default: True
profiling.include_field_distinct_value_frequencies
boolean
Whether to profile for distinct value frequencies.
Default: False
profiling.include_field_histogram
boolean
Whether to profile for the histogram for numeric fields.
Default: False
profiling.include_field_max_value
boolean
Whether to profile for the max value of numeric columns.
Default: True
profiling.include_field_mean_value
boolean
Whether to profile for the mean value of numeric columns.
Default: True
profiling.include_field_median_value
boolean
Whether to profile for the median value of numeric columns.
Default: True
profiling.include_field_min_value
boolean
Whether to profile for the min value of numeric columns.
Default: True
profiling.include_field_null_count
boolean
Whether to profile for the number of nulls for each column.
Default: True
profiling.include_field_quantiles
boolean
Whether to profile for the quantiles of numeric columns.
Default: False
profiling.include_field_sample_values
boolean
Whether to profile for the sample values for all columns.
Default: True
profiling.include_field_stddev_value
boolean
Whether to profile for the standard deviation of numeric columns.
Default: True
profiling.limit
integer
Max number of documents to profile. By default, profiles all documents.
profiling.max_workers
integer
Number of worker threads to use for profiling. Set to 1 to disable.
Default: 20
profiling.offset
integer
Offset in documents to profile. By default, uses no offset.
profiling.profile_table_level_only
boolean
Whether to perform profiling at table-level only, or include column-level profiling as well.
Default: False
profiling.operation_config
OperationConfig
Experimental feature. To specify operation configs.
profiling.operation_config.lower_freq_profile_enabled
boolean
Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.
Default: False
profiling.operation_config.profile_date_of_month
integer
Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.
profiling.operation_config.profile_day_of_week
integer
Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.
stateful_ingestion
StatefulStaleMetadataRemovalConfig
Configuration for stateful ingestion and stale metadata removal.
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Code Coordinates

  • Class Name: datahub.ingestion.source.cassandra.cassandra.CassandraSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for Cassandra, feel free to ping us on our Slack.