Conversion from PySpark.DataFrame to Pandas.DataFrame with Pandas 2.0
Issue​
With Pandas 2.0, when converting Tecton Spark query results (such as
get_features_for_events or get_features_in_range) to Pandas, you might
encounter an error similar to the following:
TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead.
Cause​
This error occurs due to a type conversion issue in PySpark when converting
TimestampType to datetime64, which is incompatible with Pandas 2.0. The bug
has been fixed in PySpark 3.5+, but for older versions, it remains an issue.
For more details, refer to this Spark fix PR.
Resolution​
Enable Arrow Conversion​
You can enable Arrow conversion to bypass the problematic type conversion code.
Here an example to enable Arrow conversion in the Databricks or EMR SparkSession:
import datetime
# Enable Arrow conversion
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.createDataFrame([[datetime.datetime.now()]], "timestamp: timestamp").toPandas()
Here an example to enable Arrow conversion in a new SparkSession:
import datetime
import pyspark
# Enable Arrow conversion
spark = pyspark.sql.SparkSession.builder.config("spark.sql.execution.arrow.pyspark.enabled", "true").getOrCreate()
spark.createDataFrame([[datetime.datetime.now()]], "timestamp: timestamp").toPandas()
Note​
- The
spark.sql.execution.arrow.pyspark.enabledflag is enabled by default on Databricks. - It is not enabled by default on EMR and open-source PySpark.
Limitations​
Arrow conversion cannot be enabled when the DataFrame contains [types ineligible
for Arrow
conversion][https://docs.databricks.com/en/pandas/pyspark-pandas-conversion.html#supported-sql-types],
such as MapType, ArrayType of TimestampType, and nested StructType.
Arrow conversion will be automatically disabled in such cases.
Among these types, MapType and ArrayType of TimestampType can be handled
without Arrow conversion enabled. However, nested StructType with
TimestampType cannot be handled without Arrow conversion.
Here is an example of a nested StructType with TimestampType:
from pyspark.sql.types import StructType, StructField, TimestampType, IntegerType
spark.createDataFrame(
[
[datetime.datetime.now(), {"inner": {"i": 1}}],
],
StructType(
[
StructField("ts", TimestampType()),
StructField("s", StructType([StructField("inner", StructType([StructField("i", IntegerType())]))])),
]
),
).toPandas()
For nested StructType with TimestampType, the best solution is to upgrade to
Spark 3.5+ where this issue is resolved.