How can I generate a new column from existing column in pyspark using pattern matching

Question

I have pyspark data frame with col1 as original column, and I want to generate new derived_col1 with values as below.

for first two rows the output abbreviation WA and MA.

WA - Web Audio

col1	derived_col1
6_23_qc_77563_Amz_WA	Web Audio
Display Everywhere - sample	Display Everywhere
Mobile Audio_Sample	Mobile Audio
Amazon_Web Audio	Web Audio
Flip_Mobile Audio	Mobile Audio
Web Audio_Carton	Web Audio

These example are most common patterns that I found in input data

Here the issue is col1 has different patterns and no thought on how to get this problem solved.

Any ideas?

Mohana B C · Accepted Answer

You can split col1 column by _ or - then using filter() higher order column function get the required values. Using when() and otherwise() you can replace 'WA' and 'MA' as required.

df.withColumn('derived_col1', expr("filter(split(col1, '(_|-)'), c-> c like 'Display%' or c like 'Mobile%' or c like 'Web%' or c like '%MA' or c like '%WA')")) \
.withColumn('derived_col1', concat_ws('','derived_col1')) \
.withColumn('derived_col1', when(col('derived_col1')=='MA', 'Mobile Audio').when(col('derived_col1')=='WA', 'Web Audio').otherwise(col('derived_col1'))) \
.show(truncate=False)

'''
+----------------------------------+-------------------+
|col1                              |derived_col1       |
+----------------------------------+-------------------+
|6_23_KS_77563_S8_V1_KButtrick_MA  |Mobile Audio       |
|6_23_QFC_77563_S18_V1_KButtrick_WA|Web Audio          |
|Display Everywhere - Fry's        |Display Everywhere |
|Mobile Audio_Fred Meyer           |Mobile Audio       |
|Bak_Web Audio                     |Web Audio          |
|Mobile Audio_Dillons              |Mobile Audio       |
|Smith_Web Audio                   |Web Audio          |
|Smith_Mobile Audio                |Mobile Audio       |
|Web Audio_Dillons                 |Web Audio          |
|Web Audio_Kroger                  |Web Audio          |
+----------------------------------+-------------------+
'''

How can I generate a new column from existing column in pyspark using pattern matching

Answers (1)

Related Questions