Reading Feature Data for Training
Reading feature data for training is the first step in training a model.
The <feature service>.get_historical_features()
function reads training data
for the features in a Feature Service.
Before writing code to call the function, let's get a conceptual understanding of how the function works.
<feature service>.get_historical_features()
concepts​
The output of get_historical_features()
​
<feature service>.get_historical_features()
returns a DataFrame
containing:
- Columns for all features in
<feature service>
, in the format<feature view name>__<feature name>
. - For built-in aggregations, returns columns in the format
<feature view name>__<column name in the Aggregation object>_<function name in the Aggregation object>_<time_window value in Aggregation object>_<aggregation_interval value>
- Additional columns may be returned, as requested by caller of
get_historical_features()
. This is explained in the next section.
fraud_detection_feature_service.get_historical_features()
returns a
DataFrame
containing columns for the features in the following format. (Other
columns that are returned are not shown).
Feature View Name | Columns returned |
---|---|
user_credit_card_issuer | user_credit_card_issuer__credit_card_issuer |
user_transaction_counts | user_transaction_counts__transaction_count_1d_1d , user_transaction_counts__transaction_count_30d_1d , user_transaction_counts__transaction_count_90d_1d |
user_home_location | user_home_location__lat , user_home_location__long |
transaction_amount_is_high | transaction_amount_is_high__transaction_amount_is_high |
transaction_distance_from_home | transaction_distance_from_home__dist_km |
The spine​
get_historical_features()
takes a spine as input. A spine is a DataFrame
(consisting of rows and columns), that identifies the feature data to be read
from the offline store.
get_historical_features()
joins the spine to the Feature Views in the
<feature service>
.
Each row of a spine is known as a training event. Internally, Tecton will
convert the spine into a relational database table when
get_historical_features()
is called.
A spine is created by the user and requires:
-
The columns that are needed to join the spine with the Feature Views in the Feature Service. These columns are:
-
The entity columns used in the feature views that the Feature Service contains. For
fraud_detection_feature_service
:Feature View Entities Used (Specified in entities
)user_credit_card_issuer
user_id
user_transaction_counts
user_id
user_home_location
user_id
transaction_amount_is_high
andtransaction_distance_from_home
do not specify theentities
column, because entities are not used in On-Demand Feature Views. -
A timestamp key (always required, regardless of the
<feature service>
that callsget_historical_features()
). Forfraud_detection_feature_service
, this key istimestamp
, because the spine is built from thetransactions
data source, which has atimestamp
column.
-
-
Columns for the input(s) for each On Demand Feature View, if any, in the Feature Service. For
fraud_detection_feature_service
these inputs are all found in thetransactions
data source, that the spine is built against.Feature View Inputs Required Columns in Spine transaction_amount_is_high
amt
amt
transaction_distance_from_home
user_home_location
(a Feature View),merch_lat
,merch_long
merch_lat
,merch_long
. Note: Columns fromuser_home_location
are not required in the spine, because thefraud_detection_feature_service
service already uses this feature view. -
Any additional columns you want to include in the spine output and output of
get_historical_features()
. Here, theis_fraud
column is included, because this is the label (the value that is being predicted).
Joining the spine to the Feature Views​
The following diagram shows how
fraud_detection_feature_service.get_historical_features()
joins the spine with
the Feature Views in the Feature Service. In the diagram, the user_id
and
timestamp
values are included for illustration purposes and are not related to
the data used elsewhere in this tutorial.
The On-Demand Feature Views (transaction_amount_is_high
and
transaction_distance_from_home
) are not included in the diagram, because they
are not joined to the spine.
To save space in the diagram, Feature View names are not included in the feature
name columns of the get_historical_features()
output. For example,
user_home_location__lat
is shown as __lat
.
When the spine is joined to the Feature Views, an AS OF join (also known as a point-in-time join), is used. For more information, see this section.
Generating the output of the On-Demand Feature Views​
After fraud_detection_feature_service.get_historical_features()
joins the
spine to the Feature Views, it creates a resultset containing that data. This
resultset is incomplete because the output of the On-Demand Feature Views needs
to be added (explained below).
The transaction_amount_is_high
Feature View is run on each row of the
resultset, with amt
used as input. The
transaction_amount_is_high__transaction_amount_is_high
column is then added to
the resultset.
The transaction_distance_from_home
Feature View is run on each row of the
resultset, with user_home_location__lat
, user_home_location__long
,
merch_lat
and merch_long
used as inputs. The
transaction_distance_from_home__transaction_distance_from_home
column is then
added to the resultset.
Calling fraud_detection_feature_service.get_historical_features()
​
Now that you have a conceptual understanding of how
<feature service>.get_historical_features()
works, you will create a spine and
call get_historical_features()
with the spine.
Creating the spine​
Create the spine by querying the data source, filtering on a time range that is
specified in start_time
and end_time
. Select the columns as discussed in the
concepts section above:
training_events = (
ws.get_data_source("transactions")
.get_dataframe(
start_time=datetime(2023, 6, 20, 10, 26, 41),
end_time=datetime(2023, 6, 20, 15, 56, 0),
)
.to_spark()
.select("user_id", "timestamp", "amt", "merch_lat", "merch_long", "is_fraud")
)
training_events.show()
Sample Output:
user_id | timestamp | amt | merch_lat | merch_long | is_fraud |
---|---|---|---|---|---|
user_884240387242 | 2023-06-20 10:26:41 | 68.23 | 42.710006 | -78.338644 | 0 |
user_268514844966 | 2023-06-20 12:57:20 | 32.98 | 39.153572 | -122.36427 | 0 |
user_722584453020 | 2023-06-20 14:49:59 | 4.5 | 33.033236 | -105.7457 | 0 |
user_337750317412 | 2023-06-20 14:50:13 | 7.68 | 40.682842 | -88.808371 | 0 |
user_934384811883 | 2023-06-20 15:55:09 | 68.97 | 39.144282 | -96.125035 | 1 |
You do not need to run <workspace>.get_data_source().get_dataframe()
to create
the spine, but the spine must meet the
requirements explained previously.
Calling fraud_detection_feature_service.get_historical_features()
with the spine​
In your notebook, run the following code:
fraud_detection_feature_service = ws.get_feature_service("fraud_detection_feature_service")
training_data = fraud_detection_feature_service.get_historical_features(
spine=training_events, timestamp_key="timestamp", from_source=True
).to_spark() # Use from_source=True because materialization isn't enabled
training_data.show()
Following is example output from a call to
fraud_detection_feature_service.get_historical_features()
:
user_id | timestamp | amt | merch_lat | merch_long | is_fraud | user_credit_card_issuer__credit_card_issuer | user_transaction_counts__transaction_id_count_1d_1d | user_transaction_counts__transaction_id_count_30d_1d | user_transaction_counts__transaction_id_count_90d_1d | user_home_location__lat | user_home_location__long | transaction_amount_is_high__transaction_amount_is_high | transaction_distance_from_home__dist_km |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_268514844966 | 2023-06-20 12:57:20 | 32.98 | 39.1536 | -122.364 | 0 | other | 2 | 20 | 51 | 46.0916 | -103.135 | False | 1746.71 |
user_337750317412 | 2023-06-20 14:50:13 | 7.68 | 40.6828 | -88.8084 | 0 | Visa | 0 | 10 | 55 | 40.6428 | -89.5988 | False | 66.8401 |
user_722584453020 | 2023-06-20 14:49:59 | 4.5 | 33.0332 | -105.746 | 0 | Discover | 2 | 30 | 95 | 32.4259 | -106.614 | False | 105.633 |
user_884240387242 | 2023-06-20 10:26:41 | 68.23 | 42.71 | -78.3386 | 0 | other | 0 | 27 | 101 | 38.9021 | -88.6645 | False | 966.133 |