How to convert a String into a List using spark function PySpark

I am fetching a column from a Dataframe. The column is of string type.

x = "[{somevalues, id:1, name:'xyz'}, {address:Some Value}, {somevalue}]" & so on..

The data is stored as a string. It can be easily represented as a list. I want the output to be:

LIST of [
{somevalues, id:1, name:'xyz'}, 
{address:Some Value}, 
{somevalue}
]

How can I achieve this using Spark's API? I know that with Python I can use the eval(x) function and it will return the list or I can use the x.split() function, which will also return a list. However, in this approach, it needs to iterate for each record.

Also, I want to use mapPartition; that is the reason why I need my string column to be in a list so that I can pass it to mapPartition.

Is there an efficient way where I can also convert my string data using spark API or would mapPartitions be even better as I will be looping every partition rather than every record?

Upvotes: 1

Answers (2)

Ramesh Maharjan

Reputation: 41987

If you don't want to go to dataframes then you can use regex replace and split functions on the rdd data you created .

If you have data as

x = "[{somevalues, id:1, name:'xyz'}, {address:Some Value}, {somevalue}]"

Then you can create rdd and use regex replace and split functions as

import re
rdd = sc.parallelize([x]).flatMap(lambda x: re.sub("},\\{", "};&;{", re.sub("[\\[\\]\s+]", "", x)).split(";&;"))

flatMap is used so that the splitted data comes in separate rows as

{somevalues,id:1,name:'xyz'}
{address:SomeValue}
{somevalue}

I hope the answer is helpful

Note : If you want the solution in dataframe way then you can get ideas from my other answer

Upvotes: 0

xan

Reputation: 418

You can use regexp_replace to remove the square brackets and then split on the comma. At first, I'd thought you need to do something special to avoid splitting on the commas within the curly brackets. But it seems spark sql automatically avoids that. For example, the following query in Zeppelin

%sql
select split(regexp_replace("[{somevalues, id:1, name:'xyz'}, {address:Some Value}, {somevalue}]",  "[\\[\\] ]", ""), ",")

gives me

WrappedArray({somevalues, id:1, name:'xyz'}, {address:SomeValue}, {somevalue})

which is what you want.

You can use withColumn to add a column in this way if you're working with dataframes. And for some reason, if the comma within the curly brackets is being split on, you can do more regex-foo as in this post - Regex: match only outside parenthesis (so that the text isn't split within parenthesis)?.

Hope that makes sense. I'm not sure if you're using dataframes, but they're recommended over the lower level RDD api.

Upvotes: 2

How to convert a String into a List using spark function PySpark

Answers (2)

Related Questions