Create a Batch Data Source
This guide shows you how to create a Tecton BatchSource.
You must register a data source with Tecton before you define features based on that data. To register a data source, follow these steps:
- Define a data source object.
- Apply your data source to Tecton using the Tecton CLI.
- Verify the data source by querying it in a notebook.
This guide assumes you've already set up the permissions required for Tecton to read from the source.
In the first example, we'll use a Hive table for batch data, but the same principles apply for any raw data source, including streams. See Data Sources overview or the Data Sources API for more details on other Data Sources.
Example of Defining a Batch Data Source Object​
In this example, we define a BatchSource
that contains the configuration
necessary for Tecton to access our Hive user table.
Create a new file in your feature repository, and paste in the following code:
from tecton import HiveConfig, BatchSource
fraud_users_batch = BatchSource(
name="users_batch",
batch_config=HiveConfig(database="fraud", table="fraud_users"),
)
In the example definition above, we also added metadata parameters for
organization, such as name
and tags
.
Applying the Data Source​
So far, all we've done is written code in our local feature repository. In order to use the data source in Tecton, we need to apply our new definition to Tecton. We can do this using the Tecton CLI:
$ tecton apply
Using workspace "prod"
✅ Imported 15 Python modules from the feature repository
✅ Collecting local feature declarations
✅ Performing server-side validation of feature declarations
↓↓↓↓↓↓↓↓↓↓↓↓ Plan Start ↓↓↓↓↓↓↓↓↓↓
+ Create BatchDataSource
name: users_batch
↑↑↑↑↑↑↑↑↑↑↑↑ Plan End ↑↑↑↑↑↑↑↑↑↑↑↑
Are you sure you want to apply this plan? [y/N]>
Enter y
to apply this definition to Tecton.
Verify the Data Source​
To verify that the data sources are connected properly, use the Tecton SDK in a notebook environment:
import tecton
users_batch = tecton.get_workspace('my_workspace').get_data_source('users_batch')
print(users_batch.get_dataframe().to_pandas().head(10))
With a Data Source defined and verified, you are now ready to define Tecton Feature Views that make use of this data. You can also configure your Batch Data Source following the instructions below.
Configuring a BatchSource
​
In the above example, we used a HiveConfig for the BatchSource. Tecton supports several other configurations for different sources of data that can be used as follows:
-
Declare a configuration object that is an instance of a configuration class specific to your source. Tecton supports these configuration classes:
FileConfig
: File source (such as a file on S3)HiveConfig
: Hive (or Glue) TableUnityConfig
: Unity TableRedshiftConfig
: Redshift Table or QuerySnowflakeConfig
: Snowflake Table or QuerySparkBatchConfig
: Custom function to create a Spark DataFrame
note-
Tecton on Snowflake only supports
SnowflakeConfig
. -
Please contact Tecton to enable Unity Catalog on your deployment before using
UnityConfig
. :::
The complete list of configurations can be found in API Reference.
As an alternative to using a configuration object, you can use a Data Source Function, which offers more flexibility.
-
Declare a
BatchSource
object that references the configuration defined in the previous step:name
: A unique identifier for the batch source. For example,"click_event_log"
.batch_config
: The configuration created in the step above.
The
batch_config
object definition may optionally contain a timestamp column representing the time of each record. Values in the timestamp column must be one of the following formats:- A native TimestampType object.
- A string representing a timestamp that can be parsed by default Spark SQL
yyyy-MM-dd'T'hh:mm:ss.SSS'Z'
. - A customized string representing a timestamp, for which you can provide a custom timestamp_format to parse the string. The format has to follow this guideline.
A timestamp column must be specified in the
batch_config
object if anyBatchFeatureView
s use aFilteredSource
with aBatchSource
specified that uses thebatch_config
object.
See the Data Source API reference for detailed descriptions of Data Source attributes.
Example​
The following example declares a BatchSource
object that contains a
configuration for connecting to Snowflake.
- Spark
- Snowflake
click_stream_snowflake_ds = SnowflakeConfig(
url="https://[your-cluster].eu-west-1.snowflakecomputing.com/",
database="YOUR_DB",
schema="CLICK_STREAM_SCHEMA",
warehouse="COMPUTE_WH",
table="CLICK_STREAM",
)
clickstream_snowflake_ds = BatchSource(
name="click_stream_snowflake_ds",
batch_config=click_stream_snowflake_ds,
)
click_stream_snowflake_ds = SnowflakeConfig(
database="YOUR_DB",
schema="CLICK_STREAM_SCHEMA",
table="CLICK_STREAM",
)
clickstream_snowflake_ds = BatchSource(
name="click_stream_snowflake_ds",
batch_config=click_stream_snowflake_ds,
)