Reddspark
Reddspark

Reputation: 7577

How to convert from Pandas' DatetimeIndex to DataFrame in PySpark?

I have the following code:

# Get the min and max dates
minDate, maxDate = df2.select(f.min("MonthlyTransactionDate"), f.max("MonthlyTransactionDate")).first()
d = pd.date_range(start=minDate, end=maxDate, freq='MS')    

tmp = pd.Series(d)
df3 = spark.createDataFrame(tmp)

I have checked tmp and a I have a pandas dataframe of a list of dates. I then check df3 but it looks like lit's just an empty list:

++ 
|| 
++ 
|| 
|| 
|| 
|| 
|| 
|| 
|| 
||

What's happening?

Upvotes: 3

Views: 1566

Answers (3)

Neeraj Bhadani
Neeraj Bhadani

Reputation: 3110

In your case d is DatetimeIndex. What you can do is create pandas DataFrame from DatetimeIndex and then convert Pandas DF to spark DF. PFB Sample code.

1. Create DatetimeIndex

import pandas as pd
d = pd.date_range('2018-12-01', '2019-01-02', freq='MS')

2. Create Pandas DF.

p_df = pd.DataFrame(d)

3. Create Spark DataFrame.

spark.createDataFrame(p_df).show()

Upvotes: 6

akuiper
akuiper

Reputation: 215117

d is a DatetimeIndex, not a pandas data frame here. You need to convert it to data frame first which can be done using to_frame method:

d = pd.date_range('2018-10-10', '2018-12-15', freq='MS')
spark.createDataFrame(d).show()
++
||
++
||
||
++

spark.createDataFrame(d.to_frame()).show()
+-------------------+
|                  0|
+-------------------+
|2018-11-01 00:00:00|
|2018-12-01 00:00:00|
+-------------------+

Upvotes: 3

Related Questions