hellotherebj
hellotherebj

Reputation: 121

How to change column value ( pattern ) in pyspark

Let's say I have a dataframe looks like this:

 TEST_schema = StructType([StructField("Quarter", StringType(), True)])
                             
    TEST_data= [('2020_1',), ('2020_1',), ('2020_2',), ('2020_3',), ('2020_4',)] 
    
    TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)
    TEST_df.show()  
    
    +-------+
    |Quarter|
    +-------+
    | 2020_1|
    | 2020_1|
    | 2020_2|
    | 2020_3|
    | 2020_4|
    +-------+

What I want to do is, I want to change 2020_1 to Q1 2020 , 2020_2 to Q2 2020, 2020_3, Q3 2020, 2020_4 to Q4 2020

So generally 2020_# -> to Q# yyyy where yyyy = years

Upvotes: 0

Views: 62

Answers (1)

Lamanus
Lamanus

Reputation: 13591

You cannot use the to_timestamp function to parse the quarter, so should use the regular expression.

df = spark.createDataFrame(['2020_1'], 'string').toDF('val')

df.withColumn('new', concat(lit('Q'), regexp_extract('val', '[0-9]{1}$', 0), lit(' '), regexp_extract('val', '^[0-9]{4}', 0))).show()

+------+-------+
|   val|    new|
+------+-------+
|2020_1|Q1 2020|
+------+-------+

Upvotes: 2

Related Questions