Carleto
Carleto

Reputation: 951

spark - compare key with values

i'm starting with spark, and i didn't understand some concepts yet.

I have a file with pairs of names like this:

foo bar
bar foo

But are the same relation between foo and bar. i'm trying to create a rdd with just one relation

foo bar

I create this code:

step1 = joined.reduceByKey(lambda x,y: x+';'+y).map(lambda x: (x[0], x[1].split(';'))).sortByKey(True).mapValues(lambda x: sorted(x)).collect()

to create the first output, and i think i need another reduceByKey to remove existing values for the previous iteration but i don't know how to do that.

Am I thinking correctly?

Upvotes: 0

Views: 380

Answers (2)

Zhang Tong
Zhang Tong

Reputation: 4719

from pyspark.sql import functions as f    

rdd = spark.sparkContext.parallelize([('foo', 'bar'), ('bar', 'foo'), ])
df = spark.createDataFrame(rdd, schema=['c1', 'c2'])
df = df.withColumn('c3', f.sort_array(f.array(df['c1'], df['c2'])))
df.show()

# output:
+---+---+----------+
| c1| c2|        c3|
+---+---+----------+
|foo|bar|[bar, foo]|
|bar|foo|[bar, foo]|
+---+---+----------+

Using DataFrame is much easier

Upvotes: 1

santon
santon

Reputation: 4625

How about something simple like:

>>> sc.parallelize(("foo bar", "bar foo")).map(lambda x: " ".join(sorted(x.split(" ")))).distinct().collect()
['bar foo']

Upvotes: 1

Related Questions