Splitting and Transforming Feature Data
In this topic, you will split the training data retrieved earlier (using
fraud_detection_feature_service.get_historical_features()
) into training and
testing data sets and then transform the data for use by the model.
The code shown on this page is model-related rather than feature-related. Therefore, a notebook, or another location outside of a Tecton feature repository, is the appropriate place to store this code.
In your notebook, run the following code to import the needed modules:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import mean_squared_error
Splitting the feature data​
The following code splits the training data retrieved earlier (using
fraud_detection_feature_service.get_historical_features()
) into training and
testing data sets, which each have an "x" and "y" component:
X_train
: Contains the training data for the featuresY_train
: Contains the training data foris_fraud
(the value that is being predicted)X_test
: Contains the testing data for the featuresY_test
: Contains the testing data foris_fraud
(the value that is being predicted)
Run this code in your notebook:
training_data_pd = training_data.drop(
"user_id",
"merchant",
"transaction_id",
"timestamp",
"amt",
"merch_lat",
"merch_long",
).toPandas()
y = training_data_pd["is_fraud"]
x = training_data_pd.drop("is_fraud", axis=1)
X_train, X_test, y_train, y_test = train_test_split(x, y)
Transforming the feature data​
In this section, you will apply a few transformations to the feature data that will be used by the model.
Reordering the columns in the training data set​
First, you will reorder the columns in the training data set to match the column order of the inference data set that you will read later.
Feature data that is read for inference is returned with features in the following order:
- For On-Demand Feature Views, feature ordering is the same as the order of the
fields in the
output_schema
that is defined in the Feature View. - For Batch and Stream Feature Views, feature ordering is alphabetical.
- When feature data is generated by multiple feature views, On-Demand Feature Views are ordered first (in alphabetical order) followed by the others (in alphabetical order).
Reorder the columns in the training and testing data to match those of the inference data (that you will read later) by running the following code in your notebook:
reorder_columns = [
"transaction_amount_is_high__transaction_amount_is_high",
"transaction_distance_from_home__dist_km",
"user_credit_card_issuer__user_credit_card_issuer",
"user_home_location__lat",
"user_home_location__long",
"user_transaction_counts__transaction_id_count_1d_1d",
"user_transaction_counts__transaction_id_count_30d_1d",
"user_transaction_counts__transaction_id_count_90d_1d",
]
X_train = X_train.reindex(columns=reorder_columns)
X_test = X_test.reindex(columns=reorder_columns)
The remainder of the transformations​
The remainder of the transformations are included in the following code. These transformations operate on numeric and categorical values, and are built using sklearn pipelines. Run this code in your notebook.
# Get the number of numeric columns and the number of categorical columns.
num_cols = X_train.select_dtypes(exclude=["object"]).columns.tolist()
cat_cols = X_train.select_dtypes(include=["object"]).columns.tolist()
# Create a pipeline num_pipe to transform numerical values. SimpleInputer
# fills in missing values and StandardScalar standardizes the data.
num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
# Create a pipeline cat_pipe to transform categorical (string) values.
# SimpleInputer fills in missing values with N/A and OneHotEncoder encodes each of
# the categorical columns in binary.
cat_pipe = make_pipeline(
SimpleImputer(strategy="constant", fill_value="N/A"),
OneHotEncoder(handle_unknown="ignore", sparse=False),
)
# Combine the num_pipe and cat_pipe pipelines into one pipeline.
full_pipe = ColumnTransformer([("num", num_pipe, num_cols), ("cat", cat_pipe, cat_cols)])
On the next page, you will create and train a model using the training data that you transformed here.