Skip to main content
Version: Beta 🚧

Create and Manage Datasets

info

Available for Tecton on Databricks or EMR. Coming to Tecton on Snowflake in a future release.

If you are interested in this functionality, please file a feature request.

Tecton Datasets allow for conveniently saving feature data that can be used for model training, experiment reproducibility, and analysis. Datasets are versioned alongside your Feature Store configuration, allowing you to inspect and restore the state of all features as of the time the dataset was created. Tecton Datasets can be created in two ways:

  1. Saving training DataFrames that are requested from Feature Services
  2. Logging online requests to a Feature Service

Saved Training DataFrames​

Tracking feature training DataFrames as Tecton Datasets has a number advantages:

  1. Datasets are tracked and catalogued in one central place.
  2. Datasets are identified with a single string, which you can store alongside other model parameters in model metadata stores such as MLFlow.
  3. When you save a Dataset, Tecton stores both the data and the metadata associated with the features (eg, data sources, transformation logic) - allowing you to track the full lineage of a Dataset.

To save a Dataset for later retrieval, use the save parameter of get_features_for_events. If this parameter is supplied, Tecton eagerly computes the DataFrame and stores it for later retrieval alongside the metadata used to generate the Dataset. To give the Dataset a name, use the save_as parameter of get_features_for_events. Saving Datasets do not incur any Tecton costs, though there may be additional S3 costs due to writes and storage.

Defining a Saved Training DataFrame Dataset​

In the code example below, create a Dataset by providing the save_as argument to the FeatureService or FeatureView method get_features_for_events.

import tecton

events = pd.DataFrame([...], columns=[...])

ws = tecton.get_workspace("prod")
my_fs = ws.get_feature_service("ctr_prediction_service")

my_fs.get_features_for_events(events, save_as="my_training_data")

When the save_as or save flags are provided to get_features_for_events, Tecton automatically stores the metadata of the Dataset alongside the feature DataFrame for later retrieval.

After a Dataset is defined, it is available in the Web UI.

Logged Online Requests​

Feature Services have the ability to continuously log online requests and feature vector responses as Tecton Datasets. These logged feature datasets can be used for auditing, analysis, training dataset generation.

To enable feature logging on a Feature Service, simply add a LoggingConfig like in the example below and optionally specify a sample rate. Then run tecton apply to apply your changes.

from tecton import LoggingConfig

ctr_prediction_service = FeatureService(
name="ctr_prediction_service",
description="A Feature Service used for supporting a CTR prediction model.",
features=[ad_ground_truth_ctr_performance_7_days, user_total_ad_frequency_counts],
logging=LoggingConfig(sample_rate=0.5),
)

Within 60 seconds, this will create a new Tecton Dataset under the Datasets tab in the Web UI. This dataset will continue having new feature logs appended to it every 30 mins. If the features in the Feature Service change, a new dataset version will be created. Datasets are named with the following convention: <Feature Service name>.logged_requests.<Version>. The Dataset with the highest version number for a Feature Service will be the latest active dataset.

Logged Features

Interacting with Datasets​

Datasets can be fetched by name using the code snippet below:

import tecton

ws = tecton.get_workspace("prod")
my_training_data = ws.get_dataset("my_training_data")

# View my_dataset as a Pandas DataFrame
my_training_data.to_pandas().head()

Fetch a Dataset's Events DataFrame​

All Tecton Datasets contain a reference to their "events DataFrame", which contains the join keys and request data used to generate feature vectors.

If the Dataset was saved during training data generation, then this DataFrame was passed to Tecton during the call to get_features_for_events(...).

In the case of a logged requests Dataset, this DataFrame is the accumulated list of online requests to the Feature Service.

To fetch a Dataset's events DataFrame, run the following code in a notebook:

import tecton

ws = tecton.get_workspace("prod")
dataset_events = ws.get_dataset("my_training_data").get_spine_dataframe()
dataset_events.to_pandas().head()

This can be used as input to reproduce a Dataset from scratch, or test out new features.

Deleting Datasets​

You can delete the dataset using the workspace.delete_dataset('my_training_data') method. The underlying data will be cleaned up from S3 and the dataset record will not appear in lookups. Please note, for Logged datasets, feature logging on the Feature Service needs to be disabled before deleting the dataset.

import tecton

ws = tecton.get_workspace("prod")
ws.delete_dataset("my_training_data")

Was this page helpful?

🧠 Hi! Ask me anything about Tecton!

Floating button icon