Version: Beta 🚧

Conversion from `PySpark.DataFrame` to `Pandas.DataFrame` with Pandas 2.0

Issue

With Pandas 2.0, when converting Tecton Spark query results (such as get_features_for_events or get_features_in_range) to Pandas, you might encounter an error similar to the following:

TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead.

Cause

This error occurs due to a type conversion issue in PySpark when converting TimestampType to datetime64, which is incompatible with Pandas 2.0. The bug has been fixed in PySpark 3.5+, but for older versions, it remains an issue.

For more details, refer to this Spark fix PR.

Resolution

Enable Arrow Conversion

You can enable Arrow conversion to bypass the problematic type conversion code.

Here an example to enable Arrow conversion in the Databricks or EMR SparkSession:

import datetime

# Enable Arrow conversion
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

spark.createDataFrame([[datetime.datetime.now()]], "timestamp: timestamp").toPandas()

Here an example to enable Arrow conversion in a new SparkSession:

import datetime
import pyspark

# Enable Arrow conversion
spark = pyspark.sql.SparkSession.builder.config("spark.sql.execution.arrow.pyspark.enabled", "true").getOrCreate()

spark.createDataFrame([[datetime.datetime.now()]], "timestamp: timestamp").toPandas()

Note

The spark.sql.execution.arrow.pyspark.enabled flag is enabled by default on Databricks.
It is not enabled by default on EMR and open-source PySpark.

Limitations

Arrow conversion cannot be enabled when the DataFrame contains [types ineligible for Arrow conversion][https://docs.databricks.com/en/pandas/pyspark-pandas-conversion.html#supported-sql-types], such as MapType, ArrayType of TimestampType, and nested StructType. Arrow conversion will be automatically disabled in such cases.

Among these types, MapType and ArrayType of TimestampType can be handled without Arrow conversion enabled. However, nested StructType with TimestampType cannot be handled without Arrow conversion.

Here is an example of a nested StructType with TimestampType:

from pyspark.sql.types import StructType, StructField, TimestampType, IntegerType

spark.createDataFrame(
    [
        [datetime.datetime.now(), {"inner": {"i": 1}}],
    ],
    StructType(
        [
            StructField("ts", TimestampType()),
            StructField("s", StructType([StructField("inner", StructType([StructField("i", IntegerType())]))])),
        ]
    ),
).toPandas()

For nested StructType with TimestampType, the best solution is to upgrade to Spark 3.5+ where this issue is resolved.

Conversion from PySpark.DataFrame to Pandas.DataFrame with Pandas 2.0

Issue​

Cause​

Resolution​

Enable Arrow Conversion​

Note​

Limitations​

Was this page helpful?