Reputation: 934
I have a Pandas Series object
dates = pd.Series(pd.date_range(start_date,end_date))/
.dt.strftime('%y%m%d')/
.astype(int)/
And i would like to create a Spark DF directly from the Series object, without intermediate Pandas dataframe
_schema = StructType([
StructField("date_id", IntegerType(), True),
])
dates_rdd = sc.parallelize(dates)
self.date_table = spark.createDataFrame(dates_rdd, _schema)
Error:
Error: raise TypeError("StructType can not accept object %r in type %s" %
(obj, type(obj)))
TypeError: StructType can not accept object 160101 in type <class
'numpy.int64'>
If i change the Series object as:
dates = pd.Series(pd.date_range(start_date,end_date))/
.dt.strftime('%y%m%d')/
.astype(int).values.tolist()
Error becomes:
raise TypeError("StructType can not accept object %r in type %s" % (obj,
type(obj)))
TypeError: StructType can not accept object 160101 in type <class 'int'>
How can i properly map the Int values contained in the dates list/rdd to Python native integer that is accepted from Spark Dataframes?
Upvotes: 3
Views: 7924
Reputation: 2696
This will work:
dates_rdd = sc.parallelize(dates).map(lambda x: tuple([int(x)]))
date_table = spark.createDataFrame(dates_rdd, _schema)
The purpose of the additional map in defining dates_rdd
is to make the format of the rdd match the schema
Upvotes: 3
Reputation: 5880
Believe,you have missed to create a tuple for each series value,
>>> dates = pd.Series(pd.date_range(start='1/1/1980', end='1/11/1980')).dt.strftime('%y%m%d').astype(int).values.tolist()
>>> rdd = sc.parallelize(dates).map(lambda x:(x,))
>>> _schema = StructType([StructField("date_id", IntegerType(), True),])
>>> df = spark.createDataFrame(rdd,schema=_schema)
>>> df.show()
+-------+
|date_id|
+-------+
| 800101|
| 800102|
| 800103|
| 800104|
| 800105|
| 800106|
| 800107|
| 800108|
| 800109|
| 800110|
| 800111|
+-------+
>>> df.printSchema()
root
|-- date_id: integer (nullable = true)
Upvotes: 2