Chique_Code
Chique_Code

Reputation: 1530

string extract with pySpark

I have a column targeting in spark df. The values look like this:

ab=px_d_1200;ab=9;ab=t_d_o_1000;artid=delish.recipe.46338;artid=delish_recipe_46338;avb=85;cat=recipes;role=3;sect=cooking
ab=px_d_1200;ab=8;ab=t_d_o_1000;apn=640x480_370;artid=delish.recipe.25860457;artid=delish_recipe_25860457;avb=90;cat=recipes;clc=chicken-breast-recipes;clc=insanely-easy-chicken-dinners;clc=weeknight-dinners;embedid=a6311e94-3b66-4712-8fca-eaa423e4e69a;gs_cat=response_check;gs_cat=gl_english;role=3;sect=cooking;sub=recipe-ideas;tool=recipe;urlhash=5425cac3a9c2959917d0634f5bd6d842

I need to extract role=X. Also, the value after the equal sign I need to save in another column. The desired output is:

role
3
3

Upvotes: 0

Views: 67

Answers (1)

dsk
dsk

Reputation: 2003

This can be a working solution for you

Create the Dataframe here

df = spark.createDataFrame([(1,"ab=px_d_1200;ab=9;ab=t_d_o_1000;artid=delish.recipe.46338;artid=delish_recipe_46338;avb=85;cat=recipes;role=3;sect=cooking")],[ "col1","col2"])
df.show(truncate=False)
+----+--------------------------------------------------------------------------------------------------------------------------+
|col1|col2                                                                                                                      |
+----+--------------------------------------------------------------------------------------------------------------------------+
|1   |ab=px_d_1200;ab=9;ab=t_d_o_1000;artid=delish.recipe.46338;artid=delish_recipe_46338;avb=85;cat=recipes;role=3;sect=cooking|
+----+--------------------------------------------------------------------------------------------------------------------------+

df_new = df.filter(F.col("col2").contains("role"))
df_new = df_new.withColumn("split_col", F.explode(F.split(F.col("col2"), ";")))
df_new = df_new.filter(F.col("split_col").contains("role"))
df_new = df_new.withColumn("final_col", (F.split(F.col("split_col"), "=")))
df_new = df_new.withColumn("role", F.element_at(F.col('final_col'), -1).alias('role'))
df_new.show()

+----+--------------------+---------+---------+----+
|col1|                col2|split_col|final_col|role|
+----+--------------------+---------+---------+----+
|   1|ab=px_d_1200;ab=9...|   role=3|[role, 3]|   3|
+----+--------------------+---------+---------+----+

Upvotes: 1

Related Questions