Reputation: 1779
I am looking for help as to how I can extract the hour and minutes, separately, from a string in PySpark:
df = spark.createDataFrame([['1325'], ['1433'], ['730']], ['time'])
df = df.withColumn("time", to_timestamp("time")) # cast timestamp
display(df)
# example timestamp results
time
1 1325-01-01T00:00:00.000-0500
2 1433-01-01T00:00:00.000-0500
3 null
I am not exactly sure what to do, casting it to unixtime, date, and timestamp are all not very cooperative with this type of string data.
Ideally I would like it to return:
time hour minutes
1 1325 13 25
2 1433 14 33
3 730 7 30
Upvotes: 0
Views: 470
Reputation: 13998
IIUC, one way you can try is to split the string with pattern (?=\d\d$)
, and then take hour/minutes from the resulting array:
from pyspark.sql import functions as F
df.withColumn('hm', F.split(F.lpad('time',4,'0'), '(?=\d\d$)')) \
.selectExpr('time', 'int(hm[0]) as hour', 'int(hm[1]) as minutes') \
.show()
+----+----+-------+
|time|hour|minutes|
+----+----+-------+
|1325| 13| 25|
|1433| 14| 33|
| 730| 7| 30|
| 2| 0| 2|
+----+----+-------+
Upvotes: 2