Test Batch Features
Import libraries and select your workspace​
import tecton
import pandas
from datetime import datetime, timedelta
ws = tecton.get_workspace("prod")
Load a Batch Feature View​
fv = ws.get_feature_view("user_transaction_counts")
fv.summary()
Run a Feature View transformation pipeline​
The BatchFeatureView::run
function can be used to dry run execute a Feature
View transformation pipeline over a given time range. This can be useful for
checking the output of your feature transformation logic or debugging a
materialization job.
There is no guarantee that the output data is the same as the feature values that would be created in this time frame, such as in the following cases:
- When using incremental backfills, feature data for a given time range may depend on multiple executions of the Feature view transformation pipeline.
- Feature values may be dependent on scheduling information (e.g.
batch_schedule
,data_delay
,feature_start_time
) that doesn't match thestart_time
andend_time
you provide. - Aggregations may require more input data that the window you provide with
start_time
andend_time
.
If you want to produce feature values for a given time range, you should use
get_historical_feature(start_time, end_time)
.
result_dataframe = fv.run(start_time=datetime(2021, 1, 1), end_time=datetime(2022, 1, 2)).to_pandas()
display(result_dataframe)
user_id | signup_timestamp | credit_card_issuer | |
---|---|---|---|
0 | user_600003278485 | 2021-01-01 06:25:57 | other |
1 | user_469998441571 | 2021-01-01 07:16:06 | Visa |
2 | user_502567604689 | 2021-01-01 04:39:10 | Visa |
3 | user_930691958107 | 2021-01-01 10:52:31 | Visa |
4 | user_782510788708 | 2021-01-01 20:15:25 | other |
Run with mock sources​
Mock input data sources can be passed into the BatchFeatureView::run
function
using the same source names from the Feature View definition.
users_data = pandas.DataFrame(
{
"user_id": ["user_1", "user_1", "user_2"],
"cc_num": ["423456789012", "567890123456", "678901234567"],
"signup_timestamp": [
datetime(2022, 1, 1, 2),
datetime(2022, 1, 1, 4),
datetime(2022, 1, 1, 3),
],
}
)
result_dataframe = fv.run(
start_time=datetime(2022, 1, 1),
end_time=datetime(2022, 1, 2),
users=users_data, # `users` is the name of this Feature View input.
).to_pandas()
display(result_dataframe)
user_id | signup_timestamp | credit_card_issuer | |
---|---|---|---|
0 | user_1 | 2022-01-01 02:00:00 | Visa |
1 | user_1 | 2022-01-01 04:00:00 | MasterCard |
2 | user_2 | 2022-01-01 03:00:00 | Discover |
Run a Batch Feature View with tiled aggregations​
BatchFeatureView::run
for feature views with aggregations is quite similar to
with the only different that it also supports aggregation_level parameter.
When a feature view with tile aggregates, the query operates in three logical steps:
- The feature view query is run over the provided time range. The user defined transformations are applied over the data source.
- The result of #1 is aggregated into tiles the size of the aggregation_interval.
- The tiles from #2 are combined to form the final feature values. The number of tiles that are combined is based off of the time_window of the aggregation.
To see the output of #1, use aggregation_level="disabled". For #2, use aggregation_level="partial". For #3, use aggregation_level="full".
aggregation_level="full"
is the default behavior.
For more details on aggregate_tiles, refer to Creating Features that use Time-Windowed Aggregations.
agg_fv = ws.get_feature_view("user_transaction_counts")
result_dataframe = agg_fv.run(
start_time=datetime(2022, 5, 1),
end_time=datetime(2022, 5, 2),
aggregation_level="disabled",
).to_pandas()
display(result_dataframe)
user_id | transaction | timestamp | |
---|---|---|---|
0 | user_222506789984 | 1 | 2022-05-01 21:04:38 |
1 | user_26990816968 | 1 | 2022-05-01 19:45:14 |
2 | user_337750317412 | 1 | 2022-05-01 15:18:48 |
3 | user_337750317412 | 1 | 2022-05-01 07:11:31 |
4 | user_337750317412 | 1 | 2022-05-01 01:50:51 |
result_dataframe = agg_fv.run(
start_time=datetime(2022, 5, 1),
end_time=datetime(2022, 5, 2),
aggregation_level="partial",
).to_pandas()
display(result_dataframe)
user_id | transaction_count_1d | tile_start_time | tile_end_time | |
---|---|---|---|---|
0 | user_222506789984 | 1 | 2022-05-01 00:00:00 | 2022-05-02 00:00:00 |
1 | user_26990816968 | 1 | 2022-05-01 00:00:00 | 2022-05-02 00:00:00 |
2 | user_337750317412 | 4 | 2022-05-01 00:00:00 | 2022-05-02 00:00:00 |
3 | user_402539845901 | 2 | 2022-05-01 00:00:00 | 2022-05-02 00:00:00 |
4 | user_461615966685 | 1 | 2022-05-01 00:00:00 | 2022-05-02 00:00:00 |
end = datetime(2022, 5, 2)
result_dataframe = agg_fv.run(
start_time=end
- timedelta(days=90), # Note: to get an interesting "full" aggregation, we need to provide adequate input data.
end_time=end,
aggregation_level="full",
).to_pandas()
display(result_dataframe)
user_id | timestamp | transaction_count_1d_1d | transaction_count_30d_1d | transaction_count_90d_1d | |
---|---|---|---|---|---|
0 | user_131340471060 | 2022-04-30 00:00:00 | 1 | 6 | 22 |
1 | user_131340471060 | 2022-04-23 00:00:00 | 1 | 6 | 21 |
2 | user_131340471060 | 2022-04-18 00:00:00 | 1 | 7 | 20 |
3 | user_131340471060 | 2022-04-15 00:00:00 | 2 | 7 | 19 |
4 | user_131340471060 | 2022-04-08 00:00:00 | 1 | 6 | 17 |
Get a Range of Feature Values from the Offline Store​
BatchFeatureView::get_historical_features
can read a range of feature values
from the offline store between a given start_time
and end_time
.
from_source=True
can be passed in in order to bypass the offline store and
compute features on-the-fly against the raw data source. This is useful for
testing the expected output of feature values.
Use from_source=False
(default) to see what data is materialized in the
offline store.
result_dataframe = fv.get_historical_features(
start_time=datetime(2022, 5, 1), end_time=datetime(2022, 5, 2)
).to_pandas()
display(result_dataframe)
user_id | timestamp | transaction_count_1d_1d | transaction_count_30d_1d | transaction_count_90d_1d | _effective_timestamp | |
---|---|---|---|---|---|---|
0 | user_205125746682 | 2022-05-01 00:00:00 | 2 | 8 | 34 | 2022-05-01 00:00:00 |
1 | user_222506789984 | 2022-05-01 00:00:00 | 1 | 42 | 141 | 2022-05-01 00:00:00 |
2 | user_268514844966 | 2022-05-01 00:00:00 | 1 | 29 | 66 | 2022-05-01 00:00:00 |
3 | user_394495759023 | 2022-05-01 00:00:00 | 1 | 21 | 68 | 2022-05-01 00:00:00 |
4 | user_459842889956 | 2022-05-01 00:00:00 | 1 | 14 | 39 | 2022-05-01 00:00:00 |
Read the Latest Features from Online Feature Store​
For performance reasons, this function should only be used for testing and not in a production environment. To read features online efficiently, see Reading Features for Inference
fv.get_online_features({"user_id": "user_609904782486"}).to_dict()
Out: {
"transaction_count_1d_1d": 1,
"transaction_count_30d_1d": 17,
"transaction_count_90d_1d": 56,
}
Read Historical Features from Offline Feature Store with Time-Travel​
Create a spine
DataFrame with events to look up. For more information on
spines, check out
Selecting Sample Keys and Timestamps.
spine_df = pandas.DataFrame(
{
"user_id": ["user_722584453020", "user_461615966685"],
"timestamp": [datetime(2022, 5, 1, 3, 20, 0), datetime(2022, 6, 6, 2, 30, 0)],
}
)
display(spine_df)
user_id | timestamp | |
---|---|---|
0 | user_722584453020 | 2022-05-01 03:20:00 |
1 | user_461615966685 | 2022-06-06 02:30:00 |
from_source=True
can be passed in in order to bypass the offline store and
compute features on-the-fly against the raw data source. However, this will be
slower than reading feature data that has been materialized to the offline
store.
result_dataframe = fv.get_historical_features(spine_df, from_source=True).to_pandas()
display(result_dataframe)
user_id | timestamp | user_transaction_counts__transaction_count_1d_1d | user_transaction_counts__transaction_count_30d_1d | user_transaction_counts__transaction_count_90d_1d | |
---|---|---|---|---|---|
0 | user_461615966685 | 2022-06-06 02:30:00 | 0 | 13 | 40 |
1 | user_722584453020 | 2022-05-01 03:20:00 | 0 | 28 | 73 |