Reputation: 319
I have pyspark data frame with col1
as original column, and I want to generate new derived_col1
with values as below.
for first two rows the output abbreviation WA and MA.
WA - Web Audio
col1 | derived_col1 |
---|---|
6_23_qc_77563_Amz_WA | Web Audio |
Display Everywhere - sample | Display Everywhere |
Mobile Audio_Sample | Mobile Audio |
Amazon_Web Audio | Web Audio |
Flip_Mobile Audio | Mobile Audio |
Web Audio_Carton | Web Audio |
These example are most common patterns that I found in input data
Here the issue is col1
has different patterns and no thought on how to get this problem solved.
Any ideas?
Upvotes: 0
Views: 67
Reputation: 5487
You can split col1 column by _
or -
then using filter()
higher order column function get the required values. Using when()
and otherwise()
you can replace 'WA' and 'MA' as required.
df.withColumn('derived_col1', expr("filter(split(col1, '(_|-)'), c-> c like 'Display%' or c like 'Mobile%' or c like 'Web%' or c like '%MA' or c like '%WA')")) \
.withColumn('derived_col1', concat_ws('','derived_col1')) \
.withColumn('derived_col1', when(col('derived_col1')=='MA', 'Mobile Audio').when(col('derived_col1')=='WA', 'Web Audio').otherwise(col('derived_col1'))) \
.show(truncate=False)
'''
+----------------------------------+-------------------+
|col1 |derived_col1 |
+----------------------------------+-------------------+
|6_23_KS_77563_S8_V1_KButtrick_MA |Mobile Audio |
|6_23_QFC_77563_S18_V1_KButtrick_WA|Web Audio |
|Display Everywhere - Fry's |Display Everywhere |
|Mobile Audio_Fred Meyer |Mobile Audio |
|Bak_Web Audio |Web Audio |
|Mobile Audio_Dillons |Mobile Audio |
|Smith_Web Audio |Web Audio |
|Smith_Mobile Audio |Mobile Audio |
|Web Audio_Dillons |Web Audio |
|Web Audio_Kroger |Web Audio |
+----------------------------------+-------------------+
'''
Upvotes: 1