b4me
b4me

Reputation: 31

error received when convert a pandas dataframe to spark dataframe

Since no out-of-box support for reading excel files in spark, so i first read the excel file first into a pandas dataframe, then try to convert the pandas dataframe into a spark dataframe but i got below errors (i am using spark 1.5.1)

import pandas as pd
from pandas import ExcelFile
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *

pdf=pd.read_excel('/home/testdata/test.xlsx')
df = sqlContext.createDataFrame(pdf)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 406, in createDataFrame
    rdd, schema = self._createFromLocal(data, schema)
  File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 337, in _createFromLocal
    data = [schema.toInternal(row) for row in data]
  File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 541, in toInternal
    return tuple(f.toInternal(v) for f, v in zip(self.fields, obj))
  File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 541, in <genexpr>
    return tuple(f.toInternal(v) for f, v in zip(self.fields, obj))
  File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 435, in toInternal
    return self.dataType.toInternal(obj)
  File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 191, in toInternal
    else time.mktime(dt.timetuple()))
AttributeError: 'datetime.time' object has no attribute 'timetuple'

Does anybody know how to fix it?

Upvotes: 3

Views: 3706

Answers (2)

iretex
iretex

Reputation: 69

Explicitly defining the schema would fix the problem. Depending on your use case, you can dynamically specify the schema as shown in the snippet below;

from pyspark.sql.types import  *                                      
schema = StructType([
StructField(name, 
            TimestampType() if pd.api.types.is_datetime64_dtype(col) else
            DateType() if pd.api.types.is_datetime64_any_dtype(col) else
            DoubleType() if pd.api.types.is_float_dtype(col) else StringType(), True)
for name, col in zip(df.columns, df.dtypes)])                        
sparkDf = spark.createDataFrame(df, schema)

Upvotes: 0

Sergey Bushmanov
Sergey Bushmanov

Reputation: 25189

My best guess your problem was about "incorrectly" parsing datetime data when you read your data with Pandas

The following code "just works":

import pandas as pd
from pandas import ExcelFile
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *

pdf = pd.read_excel('test.xlsx', parse_dates=['Created on','Confirmation time'])

sc = SparkContext()
sqlContext = SQLContext(sc)

sqlContext.createDataFrame(data=pdf).collect()

[Row(Customer=1000935702, Country='TW',  ...

Please note, you have one more datetime column 'Confirmation date' which in your example consists of NaT and thus reads without a problem to RDD with your short sample, but should you happen to have some data there in a full dataset you'll have to take care about that column as well.

Upvotes: 1

Related Questions