This feature is currently in Private Preview.
Connect to BigQuery
Tecton is a BigQuery Validated Partner and integrates with BigQuery to run automated feature engineering pipelines on BigQuery data. To run feature pipelines based on data in BigQuery, Tecton needs to be configured with access to your BigQuery data warehouse. The following guide shows how to configure these permissions and validate that Tecton can connect to your data source.
Configuring BigQuery Authentication​
To enable
materialization jobs
to authenticate to BigQuery you can add the service account credentials as
secrets in Tecton Secrets and reference them
in your BigQuery configuration block as shown below. If you use a different type
of authentication, you can instead use a custom pandas_batch_config
and
retrieve and inject secrets into a block of code you define there to connect to
your BigQuery instance.
Ways to Authenticate:​
The following permissions are required for the role used to access the BigQuery data source:
bigquery.jobs.create
bigquery.tables.get
bigquery.tables.create
-
Service Account Credentials
- Use Case: Can be used in development and production setups.
- How to Use:
- Add the service account credentials as secrets in Tecton Secrets and reference them in your BigQueryConfig block as shown in the example below.
- Optionally, set the environment variable
GOOGLE_APPLICATION_CREDENTIALS
to the path of the credentials file.
-
IAM Roles
- Use Case: Ideal for applications running on Google Cloud services.
- How to Use:
- Configure your cloud service with an IAM role that includes the necessary BigQuery permissions.
-
gcloud auth
Authentication- Use Case: Best suited for developers working in local or notebook environments.
- How to Use:
- Use
gcloud auth login
for personal accounts, orgcloud auth activate-service-account
with a key file for service accounts.
- Use
Testing a BigQuery Data Source​
To validate that Tecton can read your BigQuery data source, create a Tecton Data
Source definition and test that you can read data from the Data Source. The
following example shows how to define a
BigQueryConfig
in your notebook environment using service account credentials, and validate
that Tecton is able to read from your BigQuery data source.
When developing in a local or notebook environment, you may need to additionally
install
google-cloud-bigquery[pandas]
BigQuery as a data source:​
import tecton
from datetime import datetime, timedelta
from tecton import BigQueryConfig, BatchSource, Secret
# Follow the prompt to complete your Tecton Account sign in
tecton.login("https://<your-account>.tecton.ai")
# Declare BigQueryConfig instance object that can be used as an argument in BatchSource
bq_config = BigQueryConfig(
project_id="bigquery-public-data",
dataset="san_francisco_bikeshare",
location="US",
# table="bikeshare_trips",
query="""
SELECT trip_id, duration_sec, start_station_name, start_time
FROM bigquery-public-data.san_francisco_bikeshare.bikeshare_trips""",
timestamp_field="start_date",
credentials=Secret(scope="your-secrets-scope", key="your-bq-service-account-key"),
)
# Use in the BatchSource
ds = BatchSource(name="sf_bike_trips_ds", batch_config=bq_config)
ds.validate()
tecton_df = ds.get_dataframe(start_time=datetime(2018, 1, 1), end_time=datetime(2018, 1, 1) + timedelta(days=1))
tecton_df.to_pandas().head(10)
When using mode='bigquery_sql'
(batch Feature Views only), Tecton will run the
SQL transformations on Rift. SQL transformations will be pushed down into
BigQuery.
from datetime import timedelta, datetime
from tecton.types import Field, Int64, String, Timestamp
from tecton import batch_feature_view, BigQueryConfig, Aggregation, Entity
station = Entity(name="station", join_keys=["start_station_name"])
@batch_feature_view(
mode="bigquery_sql",
sources=[ds],
entities=[station],
aggregation_interval=timedelta(days=1),
aggregations=[
Aggregation(function="mean", column="duration_sec", time_window=timedelta(days=1)),
],
schema=[Field("start_station_name", String), Field("duration_sec", Int64), Field("timestamp", Timestamp)],
)
def trip_duration_by_station(sf_bike_trips):
return f"""
SELECT
start_station_name,
duration_sec,
start_date as timestamp
FROM {sf_bike_trips}
"""
trip_duration_by_station.validate()
tecton_df = trip_duration_by_station.get_features_in_range(
start_time=datetime(2018, 1, 1), end_time=datetime(2018, 1, 5)
)
tecton_df.to_pandas().head(10)