Secondary Key Aggregations
By default, Tecton's Aggregation Engine groups the raw data by the join keys of a Feature View's entity (or group of entities).
While this approach works effectively when retrieving features for a known set of keys, it is insufficient in scenarios where you need to fetch features for an unknown (and possibly indefinite) set of keys. This situation often arises in use cases such as Recommendation Systems.
With Secondary Key Aggregation, you can instruct Tecton to aggregate not only over a Feature View's entity's join keys, but also over a secondary key. At feature request time, you will only need to specify the entity's join keys.
Example​
Let's assume you are modeling a use case that recommends advertisements to show to a given user. Let's further assume that you have an historical event log of ad impressions for a given user. Let's further assume that you want to develop the following 2 features:
Let's take a look at a couple features we may want to create for an ad prediction problem. The data source for these features is a historical event log of ad impressions.
- For a given UserID, how many times have they watched each AdIds in the last (1 day, 7 days)
- For a given UserID, what are the total seconds they have watched each AdId in the last (1 day, 7 days)
Example mocked data source​
from tecton import pandas_batch_config, BatchSource
from datetime import datetime, timedelta
@pandas_batch_config(timestamp_field="timestamp")
def mock_data(context):
import pandas as pd
cols = ["user_id", "ad_id", "timestamp", "seconds_watched"]
data = [
["user_1", "ad_1", "2022-05-14 00:00:00", 1],
["user_1", "ad_1", "2022-05-14 00:00:00", 1],
["user_1", "ad_1", "2022-05-14 12:00:00", 2],
["user_1", "ad_1", "2022-05-14 23:59:59", 3],
["user_1", "ad_2", "2022-05-15 00:00:00", 4],
["user_1", "ad_3", "2022-05-15 12:00:00", 5],
["user_1", "ad_4", "2022-05-15 23:59:59", 6],
["user_1", "ad_5", "2022-05-16 00:00:00", 7],
["user_1", "ad_5", "2022-05-16 12:00:00", 8],
["user_1", "ad_5", "2022-05-16 23:59:59", 9],
["user_1", "ad_5", "2022-05-17 00:00:00", 10],
["user_1", "ad_6", "2022-05-17 00:00:00", 10],
["user_1", "ad_7", "2022-05-17 12:00:00", 11],
["user_1", "ad_8", "2022-05-17 23:59:59", 12],
["user_1", "ad_9", "2022-05-18 00:00:00", 13],
["user_1", "ad_9", "2022-05-18 12:00:00", 14],
["user_1", "ad_9", "2022-05-18 23:59:59", 15],
["user_1", "ad_10", "2022-05-19 00:00:00", 16],
["user_1", "ad_11", "2022-05-19 12:00:00", 17],
["user_1", "ad_12", "2022-05-19 23:59:59", 18],
["user_2", "ad_13", "2022-05-19 23:59:59", 20],
]
df = pd.DataFrame(data, columns=cols)
df["timestamp"] = pd.to_datetime(df["timestamp"])
ds = BatchSource(name="mock_data", batch_config=mock_data)
Example Feature View​
from tecton import Entity, batch_feature_view, Aggregation
from tecton.types import Field, String, Timestamp, Int64
user_entity = Entity(name="user", join_keys=["user_id"])
# Leverage Tecton's Secondary Key Aggregations to get per-ad metrics
@batch_feature_view(
mode="pandas",
sources=[ds],
entities=[user_entity],
aggregation_secondary_key="ad_id",
aggregation_interval=timedelta(days=1),
timestamp_field="timestamp",
offline=True,
online=True,
feature_start_time=datetime(2022, 5, 1),
aggregations=[
Aggregation(
column="impression", function="count", time_window=timedelta(days=1), name="impression_count_per_ad_1d"
),
Aggregation(
column="seconds_watched",
function="sum",
time_window=timedelta(days=1),
name="sum_seconds_watched_per_ad_1d",
),
Aggregation(
column="impression", function="count", time_window=timedelta(days=7), name="impression_count_per_ad_7d"
),
Aggregation(
column="seconds_watched",
function="sum",
time_window=timedelta(days=7),
name="sum_seconds_watched_per_ad_7d",
),
],
schema=[
Field("user_id", String),
Field("ad_id", String),
Field("timestamp", Timestamp),
Field("seconds_watched", Int64),
Field("impression", Int64),
],
)
def user_ad_watched_features(input_table):
input_table["impression"] = 1
return input_table[["user_id", "ad_id", "timestamp", "seconds_watched", "impression"]]
Pay attention to the aggregation_secondary_key
parameter. This parameter
instructs Tecton to group the raw data not only by user_id
, but also by
ad_id
.
You may wonder why you would not just specify 2 entities on the Feature View:
user_entity = Entity(name="user", join_keys=["user_id"])
ad_entity = Entity(name="user", join_keys=["ad_id"])
entities=[user_entity, ad_entity]
The difference is in how you want to retrieve the feature data at request time. If you want to retrieve an aggregation for a (user, ad) tuple, you don't need secondary key aggregates.
If you want to retrieve the aggregations for all ads for a given user, you do want to use secondary key aggregates.
Example Output​
At request time, you can now query feature values only by user_id, without
having to specify an ad_id
.
Tecton will return the aggregations for every single ad_id
that the user you
specify has interacted with in the specified time window.
import pandas as pd
training_events = pd.DataFrame(
{
"user_id": ["user_1", "user_1", "user_2"],
"timestamp": [datetime(2022, 5, 19), datetime(2022, 5, 15), datetime(2022, 5, 20)],
}
)
df = user_ad_watched_features.get_features_for_events(training_events).to_pandas()
display(df)
The format of the output includes a "keys" column for each aggregation window length containing a list of all keys found in that window. The corresponding aggregate feature values for each set of keys can be found in the remaining columns. Together these form map of keys and values.
If needed, these columns can easily be zipped into a map.
user_id | timestamp | user_ad_watched_features__ad_id_keys_1d | user_ad_watched_features__ad_id_keys_7d | user_ad_watched_features__impression_count_per_ad_1d | user_ad_watched_features__sum_seconds_watched_per_ad_1d | user_ad_watched_features__impression_count_per_ad_7d | user_ad_watched_features__sum_seconds_watched_per_ad_7d | |
---|---|---|---|---|---|---|---|---|
0 | user_1 | 2022-05-15 00:00:00 | ['ad_1'] | ['ad_1'] | 4 | 7 | [4] | [7] |
1 | user_1 | 2022-05-19 00:00:00 | ['ad_9'] | ['ad_1' 'ad_2' 'ad_3' 'ad_4' 'ad_5' 'ad_6' 'ad_7' 'ad_8' 'ad_9'] | 3 | 42 | [4 1 1 1 4 1 1 1 3] | [ 7 4 5 6 34 10 11 12 42] |
2 | user_2 | 2022-05-20 00:00:00 | ['ad_13'] | ['ad_13'] | 1 | 20 | [1] | [20] |