Remote Dataset Generation
This capability is under development and is not yet available.
The details on this page are subject to change before launch. This capability will first launch in Private Preview and will be available to a subset of users for testing before launching in Public Preview. This will be available for both Rift & Spark compute engines.
Background​
By default, Tecton's Offline Retrieval Methods
construct and execute a point-in-time correct offline feature query in the local
environment. Tecton leverages either the local Spark context or in the case of
Python-only offline feature retrieval, a local query engine that's included in
the tecton[rift]
pip package.
See this page for more information on choosing the local compute engine for offline feature retrieval.
Remote Dataset Generation​
Remote Dataset Generation aims to improve upon Tecton's offline feature retrieval experience with the following benefits:
- Scalable: Tecton will manage cluster(s) to execute offline retrieval. If needed, these can be configured in the same way that materialization clusters can be configured today (with Rift and Spark). Behind the scenes, Tecton will also automatically optimize offline retrieval to ensure that cluster resources are used optimally (e.g. by splitting up the retrieval query or the input training events DataFrame).
- Easy to Use: Offline feature data outputs will be written to S3 and will be accessible via a Tecton Dataset -- these are automatically cataloged within your Tecton workspace and Web UI and can be shared by multiple users in your organization.
- Standalone: Using this capability will only require the
tecton
Python package. It has no dependency on a local Spark context or local credentials to connect to Data Sources or the data plane. - Secure: This will conform to Tecton's Access Controls -- only users with an appropriate role will be able to retrieve offline features using this capability.
This capability will be available for both get_features_for_events
(for
Feature Views, Feature Tables, & Feature Services) and get_features_in_range
(for Feature Views & Feature Tables).
In a future iteration, Remote Dataset Generation will support appending to or overwriting an existing Tecton Dataset. This will be useful for adding new data to an existing Dataset, e.g. for automated periodic model retraining.
Usage​
# Start the DatasetJob
job = my_feature_service.get_features_for_events(events).start_dataset_job()
# -------- #
# Starting DatasetJob for my_feature_service (Job ID b241fb27...) in workspace my_workspace
# Creating Dataset "my_feature_service.dataset01"
# View DatasetJob here: https://my_deployment.tecton.ai/app/jobs
# -------- #
# Block until Tecton has completed offline feature retrieval
job.wait_for_completion()
# Retrieve the Tecton Dataset
dataset = job.get_dataset()
# Retrieve the underlying DataFrame
df = dataset.to_dataframe().to_pandas()