Materialization Jobs are Failing due to Spot Instance Availability
Issue​
When a materialization job fails due to spot instance availability, you may see an error similar to the below, when inspecting your Tecton materialization jobs.
You can quickly navigate to failing clusters by logging into your web UI, clicking the “Jobs” tab on the left-hand panel, then searching for jobs that are either retrying or have failed.
Scope​
Applies to Databricks and EMR customers.
Cause​
This happens because Tecton by default uses spot EC2 instances to run materialization jobs. Spot instances tend to be much cheaper than on-demand EC2 instances, however, spot instances may not be available in certain regions for certain instance types at certain times.
This is normally not a problem since Tecton will automatically retry materialization jobs if they “soft” fail for recoverable reasons such as spot instance availability. Tecton uses an exponential backoff delay before retrying soft-failed jobs.
That said, sometimes an instance type isn’t available for several hours or up to
a day, due to heavy utilization. We have noted that this problem is particularly
acute for the us-east-1
region, and for the m5.xlarge
and m5.2xlarge
instance types, which tend to be very popular.
Resolution​
To resolve this issue:
-
Wait until the instance type becomes available . Since spot instance availability is considered a “soft” failure, Tecton will retry materialization jobs that fail for this reason.
-
Change the cluster instance type. You can provide an EMRClusterConfig or DatabricksClusterConfig option block to your feature views to change whether to fallback to on-demand instances if spots aren’t available, the instance type, number of nodes, and other parameters.
If a feature view is particularly important or sensitive to delays in
processing, we suggest examining the first_on_demand
and
instance_availability
options, and changing the instance type if in
us-east-1
.
If a feature view is particularly important or sensitive to delays in
processing, we suggest examining the first_on_demand
and
instance_availability
options, and changing the instance type if in
us-east-1
.