Conversion from PySpark.DataFrame
to Pandas.DataFrame
with Pandas 2.0
Issue​
With Pandas 2.0, when converting Tecton Spark query results (such as
get_features_for_events
or get_features_in_range
) to Pandas, you might
encounter an error similar to the following:
TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead.
Cause​
This error occurs due to a type conversion issue in PySpark when converting
TimestampType
to datetime64
, which is incompatible with Pandas 2.0. The bug
has been fixed in PySpark 3.5+, but for older versions, it remains an issue.
For more details, refer to this Spark fix PR.
Resolution​
Enable Arrow Conversion​
You can enable Arrow conversion to bypass the problematic type conversion code.
Here an example to enable Arrow conversion in the Databricks or EMR SparkSession:
import datetime
# Enable Arrow conversion
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.createDataFrame([[datetime.datetime.now()]], "timestamp: timestamp").toPandas()
Here an example to enable Arrow conversion in a new SparkSession:
import datetime
import pyspark
# Enable Arrow conversion
spark = pyspark.sql.SparkSession.builder.config("spark.sql.execution.arrow.pyspark.enabled", "true").getOrCreate()
spark.createDataFrame([[datetime.datetime.now()]], "timestamp: timestamp").toPandas()
Note​
- The
spark.sql.execution.arrow.pyspark.enabled
flag is enabled by default on Databricks. - It is not enabled by default on EMR and open-source PySpark.
Limitations​
Arrow conversion cannot be enabled when the DataFrame contains [types ineligible
for Arrow
conversion][https://docs.databricks.com/en/pandas/pyspark-pandas-conversion.html#supported-sql-types],
such as MapType
, ArrayType
of TimestampType
, and nested StructType
.
Arrow conversion will be automatically disabled in such cases.
Among these types, MapType
and ArrayType
of TimestampType
can be handled
without Arrow conversion enabled. However, nested StructType
with
TimestampType
cannot be handled without Arrow conversion.
Here is an example of a nested StructType
with TimestampType
:
from pyspark.sql.types import StructType, StructField, TimestampType, IntegerType
spark.createDataFrame(
[
[datetime.datetime.now(), {"inner": {"i": 1}}],
],
StructType(
[
StructField("ts", TimestampType()),
StructField("s", StructType([StructField("inner", StructType([StructField("i", IntegerType())]))])),
]
),
).toPandas()
For nested StructType
with TimestampType
, the best solution is to upgrade to
Spark 3.5+ where this issue is resolved.