Reputation: 1023
I am fetching a column from a Dataframe. The column is of string
type.
x = "[{somevalues, id:1, name:'xyz'}, {address:Some Value}, {somevalue}]"
& so on..
The data is stored as a string. It can be easily represented as a list. I want the output to be:
LIST of [
{somevalues, id:1, name:'xyz'},
{address:Some Value},
{somevalue}
]
How can I achieve this using Spark's API? I know that with Python I can use the eval(x)
function and it will return the list or I can use the x.split()
function, which will also return a list. However, in this approach, it needs to iterate for each record.
Also, I want to use mapPartition
; that is the reason why I need my string column to be in a list so that I can pass it to mapPartition
.
Is there an efficient way where I can also convert my string data using spark API or would mapPartitions
be even better as I will be looping every partition rather than every record?
Upvotes: 1
Views: 4028
Reputation: 41957
If you don't want to go to dataframes
then you can use regex replace and split functions on the rdd data you created .
If you have data as
x = "[{somevalues, id:1, name:'xyz'}, {address:Some Value}, {somevalue}]"
Then you can create rdd and use regex replace and split functions as
import re
rdd = sc.parallelize([x]).flatMap(lambda x: re.sub("},\\{", "};&;{", re.sub("[\\[\\]\s+]", "", x)).split(";&;"))
flatMap
is used so that the splitted data comes in separate rows as
{somevalues,id:1,name:'xyz'}
{address:SomeValue}
{somevalue}
I hope the answer is helpful
Note : If you want the solution in dataframe
way then you can get ideas from my other answer
Upvotes: 0
Reputation: 418
You can use regexp_replace to remove the square brackets and then split on the comma. At first, I'd thought you need to do something special to avoid splitting on the commas within the curly brackets. But it seems spark sql automatically avoids that. For example, the following query in Zeppelin
%sql
select split(regexp_replace("[{somevalues, id:1, name:'xyz'}, {address:Some Value}, {somevalue}]", "[\\[\\] ]", ""), ",")
gives me
WrappedArray({somevalues, id:1, name:'xyz'}, {address:SomeValue}, {somevalue})
which is what you want.
You can use withColumn to add a column in this way if you're working with dataframes. And for some reason, if the comma within the curly brackets is being split on, you can do more regex-foo as in this post - Regex: match only outside parenthesis (so that the text isn't split within parenthesis)?.
Hope that makes sense. I'm not sure if you're using dataframes, but they're recommended over the lower level RDD api.
Upvotes: 2