Exploring Databricks Metadata with Spark Catalog and Unity Catalog
This tip is part of the Decembricks 2024 series to teach a new tip everyday of december.
Summary
- Use
spark.catalog.list
methods to programmatically explore catalogs, databases, tables and columns - Recursively traverse the metadata hierarchy to build a catalog structure dictionary
- Compare catalog structures to identify differences between environments
Overview
As your lakehouse platform grows to encompass more use cases, teams, and workloads, keeping track of all the metadata - catalogs, databases, tables, and columns - can become increasingly challenging. Fortunately, the Spark Catalog APIs provide programmatic methods to explore and analyze this metadata.
In this post, we’ll walk through an example of how to use these APIs in PySpark to:
- Recursively list all databases and tables within a catalog
- Build a dictionary representing the hierarchical catalog structure
- Compare catalog structures across environments to identify metadata differences
Exploring Metadata Programmatically
The key to programmatic metadata exploration in Databricks is the family of spark.catalog.list
methods:
spark.catalog.listCatalogs()
- List all available catalogsspark.catalog.listDatabases()
- List databases in the current catalogspark.catalog.listTables()
- List tables in a specified databasespark.catalog.listColumns()
- List columns in a specified table
These same methods are also available on the unity catalog information schema tables if you prefer to use sql.
Let’s see a quick example of using these methods to explore all databases and tables in each catalog:
for catalog in spark.catalog.listCatalogs():
print("Catalog is ",catalog.name)
spark.catalog.setCurrentCatalog(catalog.name)for database in spark.catalog.listDatabases():
print("Database is:", database.name)
for table in spark.catalog.listTables(database.name):
print("Table is:", table.name)
This will output something like:
Catalog is hive_metastore
Database is: default
Table is: customers
Table is: orders
Database is: example
Table is: flights
Catalog is unity_dev_environment
Database is: dlt_pipeline_metrics
Table is: pipeline_status
Building a Catalog Structure
While the above snippet is useful for exploration, let’s see how we can build a more programmatic representation of our catalog structure that will enable richer analysis and comparison.
def get_catalog_structure():
= {}
catalog_structure = []
all_paths
for catalog in spark.catalog.listCatalogs():
= {}
catalog_structure[catalog.name]
spark.catalog.setCurrentCatalog(catalog.name)
for database in spark.catalog.listDatabases():
= []
catalog_structure[catalog.name][database.name]
= spark.catalog.listTables(database.name)
tables for table in tables:
catalog_structure[catalog.name][database.name].append(table.name)f"{catalog.name}.{database.name}.{table.name}")
all_paths.append(
return catalog_structure, all_paths
This recursively traverses catalogs and databases, building a dictionary catalog_structure
where: - Keys are catalog names - Values are dictionaries where: - Keys are database names - Values are lists of tables in that database
It also collects a list all_paths
containing the fully-qualified paths of each table in “catalog.database.table
” notation.
The actual implementation wraps each spark.catalog
call in a try/except block to gracefully handle any missing privileges or other exceptions such as references to tables where the data no longer exists at that location.
Comparing Catalog Structures
Now that we can capture a catalog structure, let’s put it to use by comparing structures across environments to identify metadata diffs.
def compute_catalog_diffs(catalog_structure):
# Get list of all catalogs
= list(catalog_structure.keys())
catalogs = {}
diffs
# Compare each catalog with every other catalog
for i in range(len(catalogs)):
for j in range(i + 1, len(catalogs)):
= catalogs[i], catalogs[j]
cat1, cat2
# Get all paths for catalog 1
= set()
cat1_paths for db in catalog_structure[cat1]:
for table in catalog_structure[cat1][db]:
f"{db}.{table}")
cat1_paths.add(
# Get all paths for catalog 2
= set()
cat2_paths for db in catalog_structure[cat2]:
for table in catalog_structure[cat2][db]:
f"{db}.{table}")
cat2_paths.add(
# Compute differences
f"{cat1} vs {cat2}"] = {
diffs['onlyin' + cat1: cat1_paths - cat2_paths,
'onlyin' + cat2: cat2_paths - cat1_paths,
'common': cat1_paths & cat2_paths
}
return diffs
This compares every unique pair of catalogs, building path sets for each, and computing 3 diffs: 1. only_in_X
- paths only in catalog X 2. only_in_Y
- paths only in catalog Y 3. common
- paths in both
For large numbers of catalogs, you may want to selectively compare certain pairs rather than combinatorially comparing all pairs.
Usage
= get_catalog_structure()
structure, all_paths = compute_catalog_diffs(structure) differences
Resources
```