Version: 0.6

Reading Feature Data for Training

Reading feature data for training is the first step in training a model.

The <feature service>.get_historical_features() function reads training data for the features in a Feature Service.

Before writing code to call the function, let's get a conceptual understanding of how the function works.

`<feature service>.get_historical_features()` concepts

The output of `get_historical_features()`

<feature service>.get_historical_features() returns a DataFrame containing:

Columns for all features in <feature service>, in the format <feature view name>__<feature name>.
For built-in aggregations, returns columns in the format <feature view name>__<column name in the Aggregation object>_<function name in the Aggregation object>_<time_window value in Aggregation object>_<aggregation_interval value>
Additional columns may be returned, as requested by caller of get_historical_features(). This is explained in the next section.

fraud_detection_feature_service.get_historical_features() returns a DataFrame containing columns for the features in the following format. (Other columns that are returned are not shown).

Feature View Name	Columns returned
`user_credit_card_issuer`	`user_credit_card_issuer__credit_card_issuer`
`user_transaction_counts`	`user_transaction_counts__transaction_count_1d_1d`, `user_transaction_counts__transaction_count_30d_1d`, `user_transaction_counts__transaction_count_90d_1d`
`user_home_location`	`user_home_location__lat`, `user_home_location__long`
`transaction_amount_is_high`	`transaction_amount_is_high__transaction_amount_is_high`
`transaction_distance_from_home`	`transaction_distance_from_home__dist_km`

The spine

get_historical_features() takes a spine as input. A spine is a DataFrame (consisting of rows and columns), that identifies the feature data to be read from the offline store.

get_historical_features() joins the spine to the Feature Views in the <feature service>.

Each row of a spine is known as a training event. Internally, Tecton will convert the spine into a relational database table when get_historical_features() is called.

A spine is created by the user and requires:

The columns that are needed to join the spine with the Feature Views in the Feature Service. These columns are:
- The entity columns used in the feature views that the Feature Service contains. For fraud_detection_feature_service:
  
  Feature View Entities Used (Specified in entities)
  user_credit_card_issuer user_id
  user_transaction_counts user_id
  user_home_location user_id
  
  transaction_amount_is_high and transaction_distance_from_home do not specify the entities column, because entities are not used in On-Demand Feature Views.
- A timestamp key (always required, regardless of the <feature service> that calls get_historical_features()). For fraud_detection_feature_service, this key is timestamp, because the spine is built from the transactions data source, which has a timestamp column.

Feature View	Entities Used (Specified in `entities`)
`user_credit_card_issuer`	`user_id`
`user_transaction_counts`	`user_id`
`user_home_location`	`user_id`

Columns for the input(s) for each On Demand Feature View, if any, in the Feature Service. For fraud_detection_feature_service these inputs are all found in the transactions data source, that the spine is built against.

Feature View	Inputs	Required Columns in Spine
`transaction_amount_is_high`	`amt`	`amt`
`transaction_distance_from_home`	`user_home_location` (a Feature View), `merch_lat`, `merch_long`	`merch_lat`, `merch_long`. Note: Columns from `user_home_location` are not required in the spine, because the `fraud_detection_feature_service` service already uses this feature view.

Any additional columns you want to include in the spine output and output of get_historical_features(). Here, the is_fraud column is included, because this is the label (the value that is being predicted).

Joining the spine to the Feature Views

The following diagram shows how fraud_detection_feature_service.get_historical_features() joins the spine with the Feature Views in the Feature Service. In the diagram, the user_id and timestamp values are included for illustration purposes and are not related to the data used elsewhere in this tutorial.

The On-Demand Feature Views (transaction_amount_is_high and transaction_distance_from_home) are not included in the diagram, because they are not joined to the spine.

To save space in the diagram, Feature View names are not included in the feature name columns of the get_historical_features() output. For example, user_home_location__lat is shown as __lat.

get_historical_features()

note

When the spine is joined to the Feature Views, an AS OF join (also known as a point-in-time join), is used. For more information, see this section.

Generating the output of the On-Demand Feature Views

After fraud_detection_feature_service.get_historical_features() joins the spine to the Feature Views, it creates a resultset containing that data. This resultset is incomplete because the output of the On-Demand Feature Views needs to be added (explained below).

The transaction_amount_is_high Feature View is run on each row of the resultset, with amt used as input. The transaction_amount_is_high__transaction_amount_is_high column is then added to the resultset.

The transaction_distance_from_home Feature View is run on each row of the resultset, with user_home_location__lat, user_home_location__long, merch_lat and merch_long used as inputs. The transaction_distance_from_home__transaction_distance_from_home column is then added to the resultset.

Calling `fraud_detection_feature_service.get_historical_features()`

Now that you have a conceptual understanding of how <feature service>.get_historical_features() works, you will create a spine and call get_historical_features() with the spine.

Creating the spine

Create the spine by querying the data source, filtering on a time range that is specified in start_time and end_time. Select the columns as discussed in the concepts section above:

training_events = (
    ws.get_data_source("transactions")
    .get_dataframe(
        start_time=datetime(2023, 6, 20, 10, 26, 41),
        end_time=datetime(2023, 6, 20, 15, 56, 0),
    )
    .to_spark()
    .select("user_id", "timestamp", "amt", "merch_lat", "merch_long", "is_fraud")
)

training_events.show()

Sample Output:

user_id	timestamp	amt	merch_lat	merch_long	is_fraud
user_884240387242	2023-06-20 10:26:41	68.23	42.710006	-78.338644	0
user_268514844966	2023-06-20 12:57:20	32.98	39.153572	-122.36427	0
user_722584453020	2023-06-20 14:49:59	4.5	33.033236	-105.7457	0
user_337750317412	2023-06-20 14:50:13	7.68	40.682842	-88.808371	0
user_934384811883	2023-06-20 15:55:09	68.97	39.144282	-96.125035	1

note

You do not need to run <workspace>.get_data_source().get_dataframe() to create the spine, but the spine must meet the requirements explained previously.

Calling `fraud_detection_feature_service.get_historical_features()` with the spine

In your notebook, run the following code:

fraud_detection_feature_service = ws.get_feature_service("fraud_detection_feature_service")

training_data = fraud_detection_feature_service.get_historical_features(
    spine=training_events, timestamp_key="timestamp", from_source=True
).to_spark()  # Use from_source=True because materialization isn't enabled

training_data.show()

Following is example output from a call to fraud_detection_feature_service.get_historical_features():

user_id	timestamp	amt	merch_lat	merch_long	user_credit_card_issuer__credit_card_issuer	user_transaction_counts__transaction_id_count_1d_1d	user_transaction_counts__transaction_id_count_30d_1d	user_transaction_counts__transaction_id_count_90d_1d	user_home_location__lat	user_home_location__long	transaction_amount_is_high__transaction_amount_is_high	transaction_distance_from_home__dist_km
user_268514844966	2023-06-20 12:57:20	32.98	39.1536	-122.364	other	2	20	51	46.0916	-103.135	False	1746.71
user_337750317412	2023-06-20 14:50:13	7.68	40.6828	-88.8084	Visa	0	10	55	40.6428	-89.5988	False	66.8401
user_722584453020	2023-06-20 14:49:59	4.5	33.0332	-105.746	Discover	2	30	95	32.4259	-106.614	False	105.633
user_884240387242	2023-06-20 10:26:41	68.23	42.71	-78.3386	other	0	27	101	38.9021	-88.6645	False	966.133

Reading Feature Data for Training

<feature service>.get_historical_features() concepts​

The output of get_historical_features()​

The spine​

Joining the spine to the Feature Views​

Generating the output of the On-Demand Feature Views​

Calling fraud_detection_feature_service.get_historical_features()​

Creating the spine​

Calling fraud_detection_feature_service.get_historical_features() with the spine​

Was this page helpful?