user288609
user288609

Reputation: 13035

regarding the usage of cast function to process time information in pyspark

There is a dataframe shown as follows, it has two columns.

df.show()
|                Time|             MinTime|
|2019-11-19 23:00:...|2019-11-19 23:00:...|
|2019-11-19 23:15:...|2019-11-19 23:00:...|
|2019-11-19 23:30:...|2019-11-19 23:00:...|

root
 |-- Time: string (nullable = true)
 |-- MinTime: string (nullable = true)

df.show(truncate=False)
|                Time|             MinTime|
|2019-11-19 23:00:000000|2019-11-19 23:00:000000|
|2019-11-19 23:15:000000|2019-11-19 23:00:000000|
|2019-11-19 23:30:000000|2019-11-19 23:00:000000|

After I use the following line of code to process the above column, the values for column Offset are all null. Based on the values in Time and MinTime, the difference should not be null for all the rows. May I know the reason for this?

df= df.withColumn('Offset',((col('Time').cast('long') - col('MinTime').cast('long'))))
df.show()
|                Time|             MinTime|         Offset|
|2019-11-19 23:00:...|2019-11-19 23:00:...|           null|
|2019-11-19 23:15:...|2019-11-19 23:00:...|           null|
|2019-11-19 23:30:...|2019-11-19 23:00:...|           null|

df.printSchema()
root
 |-- Time: string (nullable = true)
 |-- MinTime: string (nullable = true)
 |-- Offset: long (nullable = true)

df.show(truncate=False)
|                Time|                  MinTime|          Offset|
|2019-11-19 23:00:000000|2019-11-19 23:00:000000|           null|
|2019-11-19 23:15:000000|2019-11-19 23:00:000000|           null|
|2019-11-19 23:30:000000|2019-11-19 23:00:000000|           null|

Upvotes: 0

Views: 257

Answers (1)

koiralo
koiralo

Reputation: 23109

Please check your schema of your df, if the columns type is String it has to be converted to timestamp first

You can use to_timestamp function to convert the datatype to timestamp first as

date_format = 'yyyy-MM-dd HH:mm:ss'

df.withColumn('Offset',
              (f.to_timestamp('Time', date_format).cast('long') - f.to_timestamp('MinTime').cast('long'))) \
    .show(truncate=False)

Result:

+-------------------+-------------------+------+
|Time               |MinTime            |Offset|
+-------------------+-------------------+------+
|2019-11-19 23:00:00|2019-11-19 23:00:00|0     |
|2019-11-19 23:15:00|2019-11-19 23:00:00|900   |
|2019-11-19 23:30:00|2019-11-19 23:00:00|1800  |
+-------------------+-------------------+------+

Please make sure to use the correct date format.

Upvotes: 1

Related Questions