Madhav Thaker
Madhav Thaker

Reputation: 370

Python to Pyspark Regex: Converting Strings to list

My code takes a string and extract elements within it to create a list.

Here is an example a string:

'["A","B"]'

Here is the python code:

df[column + '_upd'] = df[column].apply(lambda x: re.findall('\"(.*?)\"',x.lower()))

This results in a list that includes "A" and "B".

I'm brand new to pyspark and am a bit lost on how to do this. Ive seen people use regexp_extract but that doesn't quite apply to this problem.

Any help would be much appreciated

Upvotes: 1

Views: 334

Answers (1)

murtihash
murtihash

Reputation: 8410

You can use regexp_replace and split.

from pyspark.sql import functions as F
df.withColumn("new_col", F.split(F.regexp_replace("col", '\[|]| |"', ''),",")).show()

#+---------+-------+
#|      col|new_col|
#+---------+-------+
#|["A","B"]| [A, B]|
#+---------+-------+

#schema
 #root
 #|-- col: string (nullable = true)
 #|-- new_col: array (nullable = true)
 #|    |-- element: string (containsNull = true)

Upvotes: 1

Related Questions