Reputation: 951
i'm starting with spark, and i didn't understand some concepts yet.
I have a file with pairs of names like this:
foo bar
bar foo
But are the same relation between foo and bar. i'm trying to create a rdd with just one relation
foo bar
I create this code:
step1 = joined.reduceByKey(lambda x,y: x+';'+y).map(lambda x: (x[0], x[1].split(';'))).sortByKey(True).mapValues(lambda x: sorted(x)).collect()
to create the first output, and i think i need another reduceByKey to remove existing values for the previous iteration but i don't know how to do that.
Am I thinking correctly?
Upvotes: 0
Views: 380
Reputation: 4719
from pyspark.sql import functions as f
rdd = spark.sparkContext.parallelize([('foo', 'bar'), ('bar', 'foo'), ])
df = spark.createDataFrame(rdd, schema=['c1', 'c2'])
df = df.withColumn('c3', f.sort_array(f.array(df['c1'], df['c2'])))
df.show()
# output:
+---+---+----------+
| c1| c2| c3|
+---+---+----------+
|foo|bar|[bar, foo]|
|bar|foo|[bar, foo]|
+---+---+----------+
Using DataFrame is much easier
Upvotes: 1
Reputation: 4625
How about something simple like:
>>> sc.parallelize(("foo bar", "bar foo")).map(lambda x: " ".join(sorted(x.split(" ")))).distinct().collect()
['bar foo']
Upvotes: 1