Transformations
Overviewβ
A transformation is a function that specifies logic to run against data retrieved from external data sources.
By default, transformations are inlined into Feature Views.
The following example shows a Feature View that implements a transformation in
the body of the Feature View function my_feature_view
. The transformation runs
in spark_sql
mode and renames columns from the data source to feature_one
and feature_two
.
@batch_feature_view(
mode="spark_sql",
# ...
)
def my_feature_view(input_data):
return f"""
SELECT
entity_id,
timestamp,
column_a AS feature_one,
column_b AS feature_two
FROM {input_data}
"""
Alternatively, transformations can be
defined outside of Feature Views
in Tecton objects identified with the @transformation
decorator. This allows
transformations to be modular, discoverable in Tecton's Web UI, and reusable
across multiple Feature Views.
Transformation input and outputβ
Inputβ
The input to a transformation contains the columns in the data source.
Outputβ
When a transformation is defined inside of a Feature View, the output of the
transformation is a DataFrame
that must include:
- The join keys of all entities included in the
entities
list - A timestamp column. If there is more than one timestamp column, a
timestamp_key
parameter must be set to specify which column is the correct timestamp of the feature values. - Feature value columns. All columns other than the join keys and timestamp will be considered features in a Feature View.
Modesβ
A transformation mode specifies the format in which a transformation needs to be
written. For example, in spark_sql
mode, a transformation needs to be written
in SQL, while in pyspark
mode, a transformation needs to be written using the
PySpark DataFrame
API.
This page describes the transformation modes that are supported by transformations defined inside and outside of Feature Views.
The examples show transformations defined inside of Feature Views.
Modes for Batch Feature Views and Stream Feature Viewsβ
mode="spark_sql"
and mode="snowflake_sql"
β
Characteristic | Description |
---|---|
Summary | Contains a SQL query |
Supported Feature View types | Batch Feature View, Stream Feature View. mode="snowflake_sql" is not supported in Stream Feature Views. |
Supported data platforms | Databricks, EMR, Snowflake |
Input type | A string (the name of a view generated by Tecton) |
Output type | A string |
Exampleβ
- Spark
- Snowflake
@batch_feature_view(
mode="spark_sql",
# ...
)
def user_has_good_credit(credit_scores):
return f"""
SELECT
user_id,
IF (credit_score > 670, 1, 0) as user_has_good_credit,
date as timestamp
FROM
{credit_scores}
"""
@batch_feature_view(
mode="snowflake_sql",
# ...
)
def user_has_good_credit(credit_scores):
return f"""
SELECT
user_id,
IFF (credit_score > 670, 1, 0) as user_has_good_credit,
date as timestamp
FROM
{credit_scores}
"""
mode="pyspark"
β
Characteristic | Description |
---|---|
Summary | Contains Python code that is executed within a Spark context. |
Supported Feature View types | Batch Feature View, Stream Feature View |
Supported data platforms | Databricks, EMR |
Input type | A Spark DataFrame or a Tecton constant |
Output type | A Spark DataFrame |
Notes | Third party libraries can be included in user-defined PySpark functions if your cluster allows third party libraries. |
Exampleβ
@batch_feature_view(
mode="pyspark",
# ...
)
def user_has_good_credit(credit_scores):
from pyspark.sql import functions as F
df = credit_scores.withColumn(
"user_has_good_credit",
F.when(credit_scores["credit_score"] > 670, 1).otherwise(0),
)
return df.select("user_id", df["date"].alias("timestamp"), "user_has_good_credit")
mode="snowpark"
β
Characteristic | Description |
---|---|
Summary | Contains Python code that is executed in Snowpark, using the Snowpark API for Python. |
Supported Feature View Types | Batch Feature View |
Supported data platforms | Snowflake |
Input type | a snowflake.snowpark.DataFrame or a Tecton constant |
Output type | A snowflake.snowpark.DataFrame |
Notes | The transformation function can call functions that are defined in Snowflake. |