Shibu
Shibu

Reputation: 1542

How can I split a timestamp column into date and time in spark

I want to split timestamp value into date and time.

eg:

1/20/2016 3:20:30 PM
1/20/2016 3:20:31 PM
1/20/2016 3:20:32 PM
1/20/2016 3:20:32 PM
1/20/2016 3:20:32 PM
1/20/2016 3:20:33 PM
1/20/2016 3:20:34 PM
1/20/2016 3:20:34 PM

needs to be split into 1/20/2016 and 3:20:30 PM

using sql spilt function I am unable to process it correctly

split_col = pyspark.sql.functions.split(df['ServerTime'], ' ')
df_date = df.withColumn('Date', split_col.getItem(0))
df_time = df.withColumn('Time', split_col.getItem(1))

Any help guys????

Upvotes: 6

Views: 22168

Answers (2)

Rags
Rags

Reputation: 1881

As the date and time can come in any format, the right way of doing this is to convert the date strings to a Datetype() and them extract Date and Time part from it.

Let take the below sample data

server_times = sc.parallelize([('1/20/2016 3:20:30 PM',),
                     ('1/20/2016 3:20:31 PM',),
                     ('1/20/2016 3:20:32 PM',)]).toDF(['ServerTime'])

The Date and Time parts can be extracted as follows any format.

from pyspark.sql.functions import unix_timestamp, from_unixtime, date_format

df.select(unix_timestamp(df.ServerTime, 'm/d/yyyy h:m:ss a').alias('ut'))\
  .select(from_unixtime('ut').alias('dty'))\
  .select(date_format('dty', 'M/d/yyyy').alias('Date'),
          date_format('dty', 'h:m:s a').alias('Time'))\
  .show()

+---------+----------+
|     Date|      Time|
+---------+----------+
|1/20/2016|3:20:30 PM|
|1/20/2016|3:20:31 PM|
|1/20/2016|3:20:32 PM|
+---------+----------+

You can project these two into separate dataframes if you want.

Upvotes: 9

Alex
Alex

Reputation: 21766

You could use pyspark.sql.functions.concat to concatenate the relevant time bits together again. Let's first create some test data:

df = sc.parallelize([('1/20/2016 3:20:30 PM',),
                     ('1/20/2016 3:20:31 PM',),
                     ('1/20/2016 3:20:32 PM',)]).toDF(['ServerTime'])    

You can do this:

import pyspark.sql.functions as F
split_col = pyspark.sql.functions.split(df['ServerTime'], ' ')
df_date = df.withColumn('Date', split_col.getItem(0))
df_time = df.withColumn('Time', F.concat(split_col.getItem(1),F.lit(' '),split_col.getItem(2)))

After running df_time.show(), the following output is returned:

+--------------------+----------+
|          ServerTime|      Time|
+--------------------+----------+
|1/20/2016 3:20:30 PM|3:20:30 PM|
|1/20/2016 3:20:31 PM|3:20:31 PM|
|1/20/2016 3:20:32 PM|3:20:32 PM|
+--------------------+----------+

Running df_date.show() returns:

+--------------------+---------+
|          ServerTime|     Date|
+--------------------+---------+
|1/20/2016 3:20:30 PM|1/20/2016|
|1/20/2016 3:20:31 PM|1/20/2016|
|1/20/2016 3:20:32 PM|1/20/2016|
+--------------------+---------+

Upvotes: 1

Related Questions