Finding a regx expression in pyspark?

Question

I have a column in pyspark dataframe which contain values separated by ; 

+----------------------------------------------------------------------------------+
|name                                                                              |
+----------------------------------------------------------------------------------+
|tppid=dfc36cc18bba07ae2419a1501534aec6fdcc22e0dcefed4f58c48b0169f203f6;xmaslist=no|
+----------------------------------------------------------------------------------+

So, in this column any number of key value pair can come if i use this

df.withColumn('test', regexp_extract(col('name'), '(?<=tppid=)(.*?);', 1)).show(1,False)

i can extract the tppid but when tppid comes as last key-value pair in a row it not able to extract, I want a regx which can extract the value of a key where ever the location of it in a row.

Wiktor Stribiżew · Accepted Answer

You may use a negated character class [^;] to match any char but ;:

tppid=([^;]+)

See the regex demo

Since the third argument to regexp_extract is 1 (accessing Group 1 contents), you may discard the lookbehind construct and use tppid= as part of the consuming pattern.

Finding a regx expression in pyspark?

Answers (2)

Related Questions