Reputation: 159
Data frame has 4 columns year,month,date,hhmm
hhmm - is hour and minute concatenated eg: 10:30 is equal to 1030
dd=spark.createDataFrame([(2019,2,13,1030),(2018,2,14,1000),(2029,12,13,0300)],["Year","month","date","hhmm"])
dd.collect()
expected output in datetime format in pyspark dataframe dd
dd.collect()
2019-02-13 10:30:00
2018-2-14 10:00:00
2019-12-13 03:00:00
Upvotes: 4
Views: 7351
Reputation: 32700
For Spark 3+, you can use make_timestamp
function :
from pyspark.sql import functions as F
dd = dd.withColumn(
"time",
F.expr("make_timestamp(Year, month, date, substr(hhmm,1,2), substr(hhmm,3,2), 0)")
)
dd.show(truncate=False)
#+----+-----+----+----+-------------------+
#|Year|month|date|hhmm|time |
#+----+-----+----+----+-------------------+
#|2019|2 |13 |1030|2019-02-13 10:30:00|
#|2018|2 |14 |1000|2018-02-14 10:00:00|
#|2029|12 |13 |0300|2029-12-13 03:00:00|
#+----+-----+----+----+-------------------+
Upvotes: 3
Reputation: 3751
There is a problem with your data, 0300 integer will not load as the desired format, for me it loaded as 192, so first you have to load it as string, you just need to assign the data types using schema when doing the load. Refer to documentation. E.g. for a .csv:
from pyspark.sql import DataFrameReader
from pyspark.sql.types import *
schema = StructType([StructField("Year", StringType(), True), StructField("month", StringType(), True), StructField("date", StringType(), True), StructField("hhmm", StringType(), True)])
dd = DataFrameReader.csv(path='your/data/path', schema=schema)
After that you need to fix the data format and convert it to timestamp:
from pyspark.sql import functions as F
dd = spark.createDataFrame([('2019','2','13','1030'),('2018','2','14','1000'),('2029','12','13','300')],["Year","month","date","hhmm"])
dd = (dd.withColumn('month', F.when(F.length(F.col('month')) == 1, F.concat(F.lit('0'), F.col('month'))).otherwise(F.col('month')))
.withColumn('date', F.when(F.length(F.col('date')) == 1, F.concat(F.lit('0'), F.col('date'))).otherwise(F.col('date')))
.withColumn('hhmm', F.when(F.length(F.col('hhmm')) == 1, F.concat(F.lit('000'), F.col('hhmm')))
.when(F.length(F.col('hhmm')) == 2, F.concat(F.lit('00'), F.col('hhmm')))
.when(F.length(F.col('hhmm')) == 3, F.concat(F.lit('0'), F.col('hhmm')))
.otherwise(F.col('hhmm')))
.withColumn('time', F.to_timestamp(F.concat(*dd.columns), format='yyyyMMddHHmm'))
)
dd.show()
+----+-----+----+----+-------------------+
|Year|month|date|hhmm| time|
+----+-----+----+----+-------------------+
|2019| 02| 13|1030|2019-02-13 10:30:00|
|2018| 02| 14|1000|2018-02-14 10:00:00|
|2029| 12| 13|0300|2029-12-13 03:00:00|
+----+-----+----+----+-------------------+
Upvotes: 2