aditya
aditya

Reputation: 11

pandas dataframe(DatetimeIndex column) to spark dataframe (datetime format)

I have a python pandas dataframe (pd_df) as follows:

    time              count
0  2015-01-31           835
1  2015-02-28          1693
2  2015-03-31          2439

which I want to convert to spark dataframe (sp_df). I am using the following command :

When I tried

sp_df = sqlContext.createDataFrame(pd_df).

The first column was returned in bigint format.

time                count
1422662400000000000 835
1425081600000000000 1693

I also tried the schema as follows but it didn't work either:

from pyspark.sql.types import *
schema = StructType([
  StructField("time", StringType(), True),
  StructField("count", IntegerType(), True)])
sp_df = sqlContext.createDataFrame(pd_df, schema)

It gave me the error:

DateType can not accept object 1422662400000000000L in type <type 'long'>

Can anyone suggest me the right way to do it?

Upvotes: 1

Views: 4954

Answers (3)

pentatomic
pentatomic

Reputation: 21

Adding here for anyone who had the problem of converting a pandas date column to a spark DateType and not TimeStamp. My df column, although it was a proper dt.date type column in the pandas dataframe, automatically converted to a Spark TimeStamp (which includes the hour 00:00:00). That was undesirable.

Eventually the solution was changing the way of creating the Pandas dataframe from:

df.DataFrame() 
df['date'] = pd.to_date(['2019-01-01', 2019-02-02']).dt.date

to doing the same thing, but creating the pandas DataFrame using a dictionary

d = {'date': pd.to_date(['2019-01-01', 2019-02-02']).dt.date}
df = pd.DataFrame(data=d)

The creation of the dataframe from a dictionary fixed the problem, and now my converted Spark dataframe was able to convert it to a date and note a timestamp column.

It's worth adding that I've also tried to manually convert from Pandas to Spark by adding the mapping: np.dtype('<M8[ns]'): DateType()

Upvotes: 2

Mishter_Jingles
Mishter_Jingles

Reputation: 104

I had the same problem, fastest way is to see the bigint as a unix timestamp and then convert it via a spark api function (timestamp or date):

from_unixtime(df['time']/1000000000, format='yyyy-MM-dd HH:mm:ss').cast('timestamp')

Upvotes: 0

Rob Guderian
Rob Guderian

Reputation: 528

What I ended up doing in the same situation was to use apply to convert the date column (which for me was an int of year month day, ie 20150129) to a date using apply. Reminder, that apply runs for each row of the table, and returns the result.

import datetime
pos_data['TRANSACTION_DATE_converted'] = pos_data.TRANSACTION_DATE.apply(lambda x: datetime.datetime.strptime(str(x),'%Y%m%d').date() )

Then, I used createDataFrame, the same as you, and it read in the date. It shows up as

TRANSACTION_DATE_converted=datetime.date(2016, 9, 6)

in the schema, but this has not yet caused me any issues.

Hive stores the value as a dashed string (ie, 20140129), so I'll convert the date into a string if it gives me further grief. Though, going through two datatypes is a bit of a silly thing.

Upvotes: 0

Related Questions