Reputation: 370
My code takes a string and extract elements within it to create a list.
Here is an example a string:
'["A","B"]'
Here is the python code:
df[column + '_upd'] = df[column].apply(lambda x: re.findall('\"(.*?)\"',x.lower()))
This results in a list that includes "A" and "B".
I'm brand new to pyspark and am a bit lost on how to do this. Ive seen people use regexp_extract
but that doesn't quite apply to this problem.
Any help would be much appreciated
Upvotes: 1
Views: 334
Reputation: 8410
You can use regexp_replace
and split
.
from pyspark.sql import functions as F
df.withColumn("new_col", F.split(F.regexp_replace("col", '\[|]| |"', ''),",")).show()
#+---------+-------+
#| col|new_col|
#+---------+-------+
#|["A","B"]| [A, B]|
#+---------+-------+
#schema
#root
#|-- col: string (nullable = true)
#|-- new_col: array (nullable = true)
#| |-- element: string (containsNull = true)
Upvotes: 1