spark - compare key with values

Question

i'm starting with spark, and i didn't understand some concepts yet.

I have a file with pairs of names like this:

foo bar
bar foo

But are the same relation between foo and bar. i'm trying to create a rdd with just one relation

foo bar

I create this code:

step1 = joined.reduceByKey(lambda x,y: x+';'+y).map(lambda x: (x[0], x[1].split(';'))).sortByKey(True).mapValues(lambda x: sorted(x)).collect()

to create the first output, and i think i need another reduceByKey to remove existing values for the previous iteration but i don't know how to do that.

Am I thinking correctly?

santon · Accepted Answer

How about something simple like:

>>> sc.parallelize(("foo bar", "bar foo")).map(lambda x: " ".join(sorted(x.split(" ")))).distinct().collect()
['bar foo']

spark - compare key with values

Answers (2)

Related Questions