John Stud
John Stud

Reputation: 1779

PySpark: Extract Hour and Minutes from String

I am looking for help as to how I can extract the hour and minutes, separately, from a string in PySpark:

df = spark.createDataFrame([['1325'], ['1433'], ['730']], ['time'])
df = df.withColumn("time", to_timestamp("time"))  # cast timestamp
display(df)

# example timestamp results
  time
1 1325-01-01T00:00:00.000-0500
2 1433-01-01T00:00:00.000-0500
3 null

I am not exactly sure what to do, casting it to unixtime, date, and timestamp are all not very cooperative with this type of string data.

Ideally I would like it to return:

  time  hour  minutes
1 1325   13     25
2 1433   14     33
3 730    7      30

Upvotes: 0

Views: 470

Answers (1)

jxc
jxc

Reputation: 13998

IIUC, one way you can try is to split the string with pattern (?=\d\d$), and then take hour/minutes from the resulting array:

from pyspark.sql import functions as F

df.withColumn('hm', F.split(F.lpad('time',4,'0'), '(?=\d\d$)')) \
    .selectExpr('time', 'int(hm[0]) as hour', 'int(hm[1]) as minutes') \
    .show()
+----+----+-------+
|time|hour|minutes|
+----+----+-------+
|1325|  13|     25|
|1433|  14|     33|
| 730|   7|     30|
|   2|   0|      2|
+----+----+-------+

Upvotes: 2

Related Questions