Reputation: 11
I have a python pandas dataframe (pd_df) as follows:
time count
0 2015-01-31 835
1 2015-02-28 1693
2 2015-03-31 2439
which I want to convert to spark dataframe (sp_df). I am using the following command :
When I tried
sp_df = sqlContext.createDataFrame(pd_df).
The first column was returned in bigint format.
time count
1422662400000000000 835
1425081600000000000 1693
I also tried the schema as follows but it didn't work either:
from pyspark.sql.types import *
schema = StructType([
StructField("time", StringType(), True),
StructField("count", IntegerType(), True)])
sp_df = sqlContext.createDataFrame(pd_df, schema)
It gave me the error:
DateType can not accept object 1422662400000000000L in type <type 'long'>
Can anyone suggest me the right way to do it?
Upvotes: 1
Views: 4954
Reputation: 21
Adding here for anyone who had the problem of converting a pandas date column to a spark DateType and not TimeStamp. My df column, although it was a proper dt.date
type column in the pandas dataframe, automatically converted to a Spark TimeStamp
(which includes the hour 00:00:00). That was undesirable.
Eventually the solution was changing the way of creating the Pandas dataframe from:
df.DataFrame()
df['date'] = pd.to_date(['2019-01-01', 2019-02-02']).dt.date
to doing the same thing, but creating the pandas DataFrame using a dictionary
d = {'date': pd.to_date(['2019-01-01', 2019-02-02']).dt.date}
df = pd.DataFrame(data=d)
The creation of the dataframe from a dictionary fixed the problem, and now my converted Spark dataframe was able to convert it to a date and note a timestamp column.
It's worth adding that I've also tried to manually convert from Pandas to Spark by adding the mapping: np.dtype('<M8[ns]'): DateType()
Upvotes: 2
Reputation: 104
I had the same problem, fastest way is to see the bigint as a unix timestamp and then convert it via a spark api function (timestamp or date):
from_unixtime(df['time']/1000000000, format='yyyy-MM-dd HH:mm:ss').cast('timestamp')
Upvotes: 0
Reputation: 528
What I ended up doing in the same situation was to use apply to convert the date column (which for me was an int of year month day, ie 20150129) to a date using apply. Reminder, that apply runs for each row of the table, and returns the result.
import datetime
pos_data['TRANSACTION_DATE_converted'] = pos_data.TRANSACTION_DATE.apply(lambda x: datetime.datetime.strptime(str(x),'%Y%m%d').date() )
Then, I used createDataFrame
, the same as you, and it read in the date. It shows up as
TRANSACTION_DATE_converted=datetime.date(2016, 9, 6)
in the schema, but this has not yet caused me any issues.
Hive stores the value as a dashed string (ie, 20140129
), so I'll convert the date into a string if it gives me further grief. Though, going through two datatypes is a bit of a silly thing.
Upvotes: 0