Assaf
Assaf

Reputation: 1124

Converting pyspark DataFrame with date column to Pandas results in AttributeError

I have the following data frame (pyspark) -

 |-- DATE: date (nullable = true)
 |-- ID: string (nullable = true)
 |-- A: double (nullable = true)
 |-- B: double (nullable = true)

While trying to convert the data frame to pandas -

res2 = res.toPandas()

I got the following error - AttributeError: Can only use .dt accessor with datetimelike values

Detailed error -

    AttributeError                            Traceback (most recent call last)
<ipython-input-29-471067d510fa> in <module>
----> 1 res2 = res.toPandas()

/opt/anaconda/lib/python3.7/site-packages/pyspark/sql/dataframe.py in toPandas(self)
   2123                         table = pyarrow.Table.from_batches(batches)
   2124                         pdf = table.to_pandas()
-> 2125                         pdf = _check_dataframe_convert_date(pdf, self.schema)
   2126                         return _check_dataframe_localize_timestamps(pdf, timezone)
   2127                     else:

/opt/anaconda/lib/python3.7/site-packages/pyspark/sql/types.py in _check_dataframe_convert_date(pdf, schema)
   1705     """
   1706     for field in schema:
-> 1707         pdf[field.name] = _check_series_convert_date(pdf[field.name], field.dataType)
   1708     return pdf
   1709 

/opt/anaconda/lib/python3.7/site-packages/pyspark/sql/types.py in _check_series_convert_date(series, data_type)
   1690     """
   1691     if type(data_type) == DateType:
-> 1692         return series.dt.date
   1693     else:
   1694         return series

/opt/anaconda/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5061         if (name in self._internal_names_set or name in self._metadata or
   5062                 name in self._accessors):
-> 5063             return object.__getattribute__(self, name)
   5064         else:
   5065             if self._info_axis._can_hold_identifiers_and_holds_name(name):

/opt/anaconda/lib/python3.7/site-packages/pandas/core/accessor.py in __get__(self, obj, cls)
    169             # we're accessing the attribute of the class, i.e., Dataset.geo
    170             return self._accessor
--> 171         accessor_obj = self._accessor(obj)
    172         # Replace the property with the accessor object. Inspired by:
    173         # http://www.pydanny.com/cached-property.html

/opt/anaconda/lib/python3.7/site-packages/pandas/core/indexes/accessors.py in __new__(cls, data)
    322             pass  # we raise an attribute error anyway
    323 
--> 324         raise AttributeError("Can only use .dt accessor with datetimelike "
    325                              "values")

AttributeError: Can only use .dt accessor with datetimelike values

Any way to fix it? maybe convert something in the original datafrme?

Upvotes: 9

Views: 8347

Answers (2)

pauljohn32
pauljohn32

Reputation: 2265

I just tested the recommendation by @cs95, which does work. However, in the output of toPandas, I still see the hour:minute:seconds output, like

|2017-02-15 00:00:00|

Because Pandas takes the date value and fills in times, even though we don't want them. Hence, the format in to_timestamp is not helpful.

I get same result with somewhat simpler code:

col_name = "DATE"
res.withColumn(col_name, col(col_name).cast("timestamp"))

Here's why this is more workable. In the use case I confront, there are many (many!) columns in the Spark DataFrame and I need to find all of one type and convert to another. For example, toPandas complains about Spark Decimal variables and recommends conversion. I don't want to run that one variable at a time.

Here's a function. I have defaults set for the decimal case, but this approach works for any types to convert.

from pyspark.sql.functions import col

def spark_type_converter(sdf, x="decimal", y="float"):
    """This uses Spark cast to convert variables of type `x` to `y`.
    :param sdf: Spark data frame
    :param x: character string for input type to cast FROM. 
        This is matched against `dtype` with `startswith`, 
        so only first few characters matter.
    :param y: character string for name of type to cast INTO.
    """
    target_cols = [i for i,j in sdf.dtypes if j.startswith(x)]
    for col_name in target_cols:
        sdf = sdf.withColumn(col_name, col(col_name).cast(y)) 
    return sdf

Use in this case, to find all date variables and change to timestamp:

res = spark_type_convert(res, x="date", y="timestamp")

When toPandas runs, this "one size fits all" approach gives same result.

Upvotes: 3

cs95
cs95

Reputation: 403050

As a workaround, you may consider converting your date column to timestamp (this is more aligned with pandas' datetime type).

from pyspark.sql.functions import to_timestamp
res2 = res.withColumn('DATE', to_timestamp(res.DATE, 'yyyy-MM-dd')).toPandas()

Upvotes: 16

Related Questions