Databricks Unity Catalog Data Sources
Prerequisitesβ
- Tecton SDK 0.6+
- DBR 11+ (with the Premium plan or above)
Limitationsβ
- Tecton is currently compatible with the
SINGLE USER
Β Databricks cluster access mode, but not yet withSHARED MODE
. - In order for your Tecton notebook to be able to read directly from Unity
Catalog Data Sources (e.g. to run
FeatureView.get_historical_features(from_source=True)
), you must create your notebook cluster with theSINGLE USER
access mode. This means each Databricks user will need a separate notebook cluster.
Databricks & AWS Setupβ
- Assign your Databricks workspaces used by Tecton to the metastore that you plan to use.
- Add the Databricks Service Principal used by Tecton as users of the metastore.
- For the S3 bucket you configured as the Tecton offline store, make sure all AWS IAM requirements here are also met and this IAM role ARN is registered with storage credentials in Unity Catalog via Databricks Data Explorer.
- Create an external location for this S3 bucket with the above storage
credential and grant the Databricks account used by Tecton at least the
READ FILES
andWRITE FILES
permissions. This can be done by running the following SQL commands in a notebook or the Databricks SQL editor which is backed by a Unity-enabled cluster or SQL warehouse.CREATE EXTERNAL LOCATION [IF NOT EXISTS] <location_name>
URL 's3://<bucket_path>'
WITH ([STORAGE] CREDENTIAL <storage_credential_name>)
[COMMENT <comment_string>];
GRANT READ FILES ON EXTERNAL LOCATION <location_name> TO <tecton_databricks_account>;
GRANT WRITE FILES ON EXTERNAL LOCATION <location_name> TO <tecton_databricks_account>;
Configuring Tecton Data Sources & Feature Views to work with Unityβ
-
Please let Tecton know that you plan to use Unity Catalog, so that we can appropriately configure internal Spark clusters used by Tecton's SDK.
-
No changes are needed for Feature Views that donβt use a Unity data source.
-
Please note that changing a Feature View's data source may result in re-materialization.
-
Customers using SDK Version 0.6 can use the existing data source config
HiveConfig
by setting the database & table params as follows:test_unity_batch_source = BatchSource(
name="test_unity_batch_source",
batch_config=HiveConfig(
database="main.default", # <catalog_name>.<schema_name>
table="department", # <table_name>
),
) -
For Feature Views that depend on a Unity data source, materialization jobs must run on DBR 11.3+ using the
SINGLE USER
cluster access mode. Pin the Spark version to11.3.x-scala2.12
and setdata_security_mode
toSINGLE USER
via thebatch_compute
param in the Feature View declaration. This can be configured viaDatabricksJsonClusterConfig
as shown here:json_config = """
{
"new_cluster": {
"num_workers": 0,
"spark_version": "11.3.x-scala2.12",
"data_security_mode": "SINGLE_USER",
"node_type_id": "m5.large",
"aws_attributes": {
"ebs_volume_type": "GENERAL_PURPOSE_SSD",
"ebs_volume_count": 1,
"ebs_volume_size": 100,
"first_on_demand": 0,
"spot_bid_price_percent": 100,
"instance_profile_arn": "arn:aws:iam::your_account_id:instance-profile/your-role",
"availability": "SPOT",
"zone_id": "auto",
},
"spark_conf": {
"spark.databricks.service.server.enabled": "true",
"spark.hadoop.fs.s3a.acl.default": "BucketOwnerFullControl",
"spark.sql.sources.partitionOverwriteMode": "dynamic",
"spark.sql.legacy.parquet.datetimeRebaseModeInRead": "CORRECTED",
"spark.sql.legacy.parquet.int96RebaseModeInRead": "CORRECTED",
"spark.sql.legacy.parquet.int96RebaseModeInWrite": "CORRECTED",
"spark.master": "local[*]",
},
}
}
"""The example Feature View is then configured as follows:
@batch_feature_view(
sources=[test_unity_batch_source],
mode="spark_sql",
entities=[entity],
online=False,
offline=True,
batch_compute=DatabricksJsonClusterConfig(
json=json_config
), # only required if you're using HiveConfig to register your Unity data source
feature_start_time=datetime(2023, 5, 1),
batch_schedule=timedelta(days=1),
ttl=timedelta(days=30),
description="Test Unity FV",
)
def feature_view():
return ...