Anubhav Jain
Anubhav Jain

Reputation: 351

Finding a regx expression in pyspark?

I have a column in pyspark dataframe which contain values separated by ; 

+----------------------------------------------------------------------------------+
|name                                                                              |
+----------------------------------------------------------------------------------+
|tppid=dfc36cc18bba07ae2419a1501534aec6fdcc22e0dcefed4f58c48b0169f203f6;xmaslist=no|
+----------------------------------------------------------------------------------+

So, in this column any number of key value pair can come if i use this

df.withColumn('test', regexp_extract(col('name'), '(?<=tppid=)(.*?);', 1)).show(1,False)

i can extract the tppid but when tppid comes as last key-value pair in a row it not able to extract, I want a regx which can extract the value of a key where ever the location of it in a row.

Upvotes: 1

Views: 145

Answers (2)

Superluminal
Superluminal

Reputation: 977

in addition to the Wiktor Stribiżew's answer, you can use anchors. $ is denoting the end of the string.

tppid=\w+(?=;|\s|$) 

Also this regex extract for you only the values without the tppid= part:

(?<=tppid=)\w+(?=;|\s|$)

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626748

You may use a negated character class [^;] to match any char but ;:

tppid=([^;]+)

See the regex demo

Since the third argument to regexp_extract is 1 (accessing Group 1 contents), you may discard the lookbehind construct and use tppid= as part of the consuming pattern.

Upvotes: 1

Related Questions