Reputation: 351
I have a column in pyspark dataframe which contain values separated by ;
+----------------------------------------------------------------------------------+
|name |
+----------------------------------------------------------------------------------+
|tppid=dfc36cc18bba07ae2419a1501534aec6fdcc22e0dcefed4f58c48b0169f203f6;xmaslist=no|
+----------------------------------------------------------------------------------+
So, in this column any number of key value pair can come if i use this
df.withColumn('test', regexp_extract(col('name'), '(?<=tppid=)(.*?);', 1)).show(1,False)
i can extract the tppid but when tppid comes as last key-value pair in a row it not able to extract, I want a regx which can extract the value of a key where ever the location of it in a row.
Upvotes: 1
Views: 145
Reputation: 977
in addition to the Wiktor Stribiżew's answer, you can use anchors. $
is denoting the end of the string.
tppid=\w+(?=;|\s|$)
Also this regex extract for you only the values without the tppid=
part:
(?<=tppid=)\w+(?=;|\s|$)
Upvotes: 0
Reputation: 626748
You may use a negated character class [^;]
to match any char but ;
:
tppid=([^;]+)
See the regex demo
Since the third argument to regexp_extract
is 1
(accessing Group 1 contents), you may discard the lookbehind construct and use tppid=
as part of the consuming pattern.
Upvotes: 1