Jresearcher
Jresearcher

Reputation: 347

Convert duration (string column) to seconds PySpark

Can I convert easily a string column (with time duration) to seconds in PySpark?

Is there any function that does this directly? I would avoid to multiplicate each part of my string for seconds.

Input

id duration
1 00 00:00:34
2 00 00:04:37
3 120 00:04:37
... ...

NOTE:

Id 1 -> 0 days, 0 hours, 0 minutes, 34 seconds

Id 2 -> 0 days, 0 hours, 4 minutes, 37 seconds

Id 3 -> 120 days, 0 hours, 4 minutes, 37 seconds

Output

id duration
1 34
2 277
3 ...
... ...

Upvotes: 1

Views: 257

Answers (2)

Mahmoud
Mahmoud

Reputation: 21

you can do it easily by using unix_timestamp but I recommend doing small processing to remove days part (I didn't catch why it is not working with it) otherwise it working perfectly in HH:mm:ss format.

from pyspark.sql.functions import col, unix_timestamp
df.withColumn('seconds', unix_timestamp(col('duration'), 'HH:mm:ss')).show()

the output will be like this:

+---+--------+-------+
| id|duration|seconds|
+---+--------+-------+
|  1|00:00:34|     34|
|  2|00:04:37|    277|
|  3|00:04:37|    277|
+---+--------+-------+

Upvotes: 0

Mohana B C
Mohana B C

Reputation: 5487

You can get day,min,hour and seconds from duration column by applying split and then sum up the corresponding seconds to get desired result.

df = # input

df.withColumn("duration", split("duration", "\\s+")) \
    .withColumn("time", split(col("duration").getItem(1), ':')) \
    .select(col("id"),
            ((col("duration").getItem(0).cast("int") * 86400) +
             (col("time").getItem(0).cast("int") * 3600) +
             (col("time").getItem(1).cast("int") * 60) +
             (col("time").getItem(2))).cast("long").alias("duration")
            ).show()

+---+--------+
| id|duration|
+---+--------+
|  1|      34|
|  2|     277|
|  3|10368277|
+---+--------+

Upvotes: 1

Related Questions