Reputation: 1124
I have the following data frame (pyspark) -
|-- DATE: date (nullable = true)
|-- ID: string (nullable = true)
|-- A: double (nullable = true)
|-- B: double (nullable = true)
While trying to convert the data frame to pandas
-
res2 = res.toPandas()
I got the following error - AttributeError: Can only use .dt accessor with datetimelike values
Detailed error -
AttributeError Traceback (most recent call last)
<ipython-input-29-471067d510fa> in <module>
----> 1 res2 = res.toPandas()
/opt/anaconda/lib/python3.7/site-packages/pyspark/sql/dataframe.py in toPandas(self)
2123 table = pyarrow.Table.from_batches(batches)
2124 pdf = table.to_pandas()
-> 2125 pdf = _check_dataframe_convert_date(pdf, self.schema)
2126 return _check_dataframe_localize_timestamps(pdf, timezone)
2127 else:
/opt/anaconda/lib/python3.7/site-packages/pyspark/sql/types.py in _check_dataframe_convert_date(pdf, schema)
1705 """
1706 for field in schema:
-> 1707 pdf[field.name] = _check_series_convert_date(pdf[field.name], field.dataType)
1708 return pdf
1709
/opt/anaconda/lib/python3.7/site-packages/pyspark/sql/types.py in _check_series_convert_date(series, data_type)
1690 """
1691 if type(data_type) == DateType:
-> 1692 return series.dt.date
1693 else:
1694 return series
/opt/anaconda/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
5061 if (name in self._internal_names_set or name in self._metadata or
5062 name in self._accessors):
-> 5063 return object.__getattribute__(self, name)
5064 else:
5065 if self._info_axis._can_hold_identifiers_and_holds_name(name):
/opt/anaconda/lib/python3.7/site-packages/pandas/core/accessor.py in __get__(self, obj, cls)
169 # we're accessing the attribute of the class, i.e., Dataset.geo
170 return self._accessor
--> 171 accessor_obj = self._accessor(obj)
172 # Replace the property with the accessor object. Inspired by:
173 # http://www.pydanny.com/cached-property.html
/opt/anaconda/lib/python3.7/site-packages/pandas/core/indexes/accessors.py in __new__(cls, data)
322 pass # we raise an attribute error anyway
323
--> 324 raise AttributeError("Can only use .dt accessor with datetimelike "
325 "values")
AttributeError: Can only use .dt accessor with datetimelike values
Any way to fix it? maybe convert something in the original datafrme?
Upvotes: 9
Views: 8347
Reputation: 2265
I just tested the recommendation by @cs95, which does work. However, in the output of toPandas, I still see the hour:minute:seconds output, like
|2017-02-15 00:00:00|
Because Pandas takes the date value and fills in times, even though we don't want them. Hence, the format in to_timestamp
is not helpful.
I get same result with somewhat simpler code:
col_name = "DATE"
res.withColumn(col_name, col(col_name).cast("timestamp"))
Here's why this is more workable. In the use case I confront, there are many (many!) columns in the Spark DataFrame and I need to find all of one type and convert to another. For example, toPandas
complains about Spark Decimal variables and recommends conversion. I don't want to run that
one variable at a time.
Here's a function. I have defaults set for the decimal case, but this approach works for any types to convert.
from pyspark.sql.functions import col
def spark_type_converter(sdf, x="decimal", y="float"):
"""This uses Spark cast to convert variables of type `x` to `y`.
:param sdf: Spark data frame
:param x: character string for input type to cast FROM.
This is matched against `dtype` with `startswith`,
so only first few characters matter.
:param y: character string for name of type to cast INTO.
"""
target_cols = [i for i,j in sdf.dtypes if j.startswith(x)]
for col_name in target_cols:
sdf = sdf.withColumn(col_name, col(col_name).cast(y))
return sdf
Use in this case, to find all date
variables and change to timestamp
:
res = spark_type_convert(res, x="date", y="timestamp")
When toPandas
runs, this "one size fits all" approach gives same result.
Upvotes: 3
Reputation: 403050
As a workaround, you may consider converting your date
column to timestamp
(this is more aligned with pandas' datetime
type).
from pyspark.sql.functions import to_timestamp
res2 = res.withColumn('DATE', to_timestamp(res.DATE, 'yyyy-MM-dd')).toPandas()
Upvotes: 16