Reputation: 1542
I want to split timestamp value into date and time.
eg:
1/20/2016 3:20:30 PM
1/20/2016 3:20:31 PM
1/20/2016 3:20:32 PM
1/20/2016 3:20:32 PM
1/20/2016 3:20:32 PM
1/20/2016 3:20:33 PM
1/20/2016 3:20:34 PM
1/20/2016 3:20:34 PM
needs to be split into 1/20/2016 and 3:20:30 PM
using sql spilt function I am unable to process it correctly
split_col = pyspark.sql.functions.split(df['ServerTime'], ' ')
df_date = df.withColumn('Date', split_col.getItem(0))
df_time = df.withColumn('Time', split_col.getItem(1))
Any help guys????
Upvotes: 6
Views: 22168
Reputation: 1881
As the date and time can come in any format, the right way of doing this is to convert the date strings to a Datetype() and them extract Date and Time part from it.
Let take the below sample data
server_times = sc.parallelize([('1/20/2016 3:20:30 PM',),
('1/20/2016 3:20:31 PM',),
('1/20/2016 3:20:32 PM',)]).toDF(['ServerTime'])
The Date and Time parts can be extracted as follows any format.
from pyspark.sql.functions import unix_timestamp, from_unixtime, date_format
df.select(unix_timestamp(df.ServerTime, 'm/d/yyyy h:m:ss a').alias('ut'))\
.select(from_unixtime('ut').alias('dty'))\
.select(date_format('dty', 'M/d/yyyy').alias('Date'),
date_format('dty', 'h:m:s a').alias('Time'))\
.show()
+---------+----------+
| Date| Time|
+---------+----------+
|1/20/2016|3:20:30 PM|
|1/20/2016|3:20:31 PM|
|1/20/2016|3:20:32 PM|
+---------+----------+
You can project these two into separate dataframes if you want.
Upvotes: 9
Reputation: 21766
You could use pyspark.sql.functions.concat
to concatenate the relevant time bits together again. Let's first create some test data:
df = sc.parallelize([('1/20/2016 3:20:30 PM',),
('1/20/2016 3:20:31 PM',),
('1/20/2016 3:20:32 PM',)]).toDF(['ServerTime'])
You can do this:
import pyspark.sql.functions as F
split_col = pyspark.sql.functions.split(df['ServerTime'], ' ')
df_date = df.withColumn('Date', split_col.getItem(0))
df_time = df.withColumn('Time', F.concat(split_col.getItem(1),F.lit(' '),split_col.getItem(2)))
After running df_time.show()
, the following output is returned:
+--------------------+----------+
| ServerTime| Time|
+--------------------+----------+
|1/20/2016 3:20:30 PM|3:20:30 PM|
|1/20/2016 3:20:31 PM|3:20:31 PM|
|1/20/2016 3:20:32 PM|3:20:32 PM|
+--------------------+----------+
Running df_date.show()
returns:
+--------------------+---------+
| ServerTime| Date|
+--------------------+---------+
|1/20/2016 3:20:30 PM|1/20/2016|
|1/20/2016 3:20:31 PM|1/20/2016|
|1/20/2016 3:20:32 PM|1/20/2016|
+--------------------+---------+
Upvotes: 1