Chique_Code
Chique_Code

Reputation: 1530

fetch year, month, day from string PySpark

I need to fetch Year, Month, Day, Hour from strings in the column Time in Spark df. I also need to save those values in separate columns. The data looks like this:

ID               Time
111            2020-03-23-12:40:04
112            2020-04-23-12:40:04
113            2020-05-23-12:40:04

The desired output is:

ID        Year        Month        Day
111     2020         03            23
112     2020         04            23
113     2020         05            23

I tried:

data_df.select(
    year("Time").alias('year'), 
    month("Time").alias('month'), 
    dayofmonth("Time").alias('day')
).show()

It returned all Null values.

Upvotes: 1

Views: 613

Answers (1)

Oli
Oli

Reputation: 10406

I assume that the Time column is a string. You can either extract the values you want with regular expressions but that would be quite painful. The other option is to convert the string into a timestamp and then use the functions year, month etc.

Also, your timestamp is not in a standard format, so you need to specify it.

data = [(111, '2020-03-23-12:40:04'),
        (112, '2020-04-23-12:40:04'),
        (113, '2020-05-23-12:40:04')]
df = spark.createDataFrame(data, ['ID', 'Time'])

df\
    .withColumn('t', F.to_timestamp('Time', 'yyyy-MM-dd-HH:mm:ss'))\
    .select('ID',
            F.year('t').alias('year'),
            F.month('t').alias('month'),
            F.dayofmonth('t').alias('day'),
            F.hour('t').alias('hour')
    ).show()

which yields:

+---+----+-----+---+----+
| ID|year|month|day|hour|
+---+----+-----+---+----+
|111|2020|    3| 23|  12|
|112|2020|    4| 23|  12|
|113|2020|    5| 23|  12|
+---+----+-----+---+----+

Upvotes: 2

Related Questions