Reputation: 37
Environment is pyspark, Spark Version 2.2.
We have two rdds test1
and test2
, below are sample data
test1 = [('a', 20), ('b', 10), ('c', 2)]
test2 = [('a', 2), ('b', 3)]
Now we want to generate output1
as below, any help is appreciated.
[('a', 20, 2), ('b', 10, 3)]
Upvotes: 2
Views: 1326
Reputation: 43494
You can accomplish this with a simple join
followed by a call to map
to flatten the values.
test1.join(test2).map(lambda (key, values): (key,) + values).collect()
#[('a', 20, 2), ('b', 10, 3)]
To explain, the result of the join
is the following:
test1.join(test2).collect()
#[('a', (20, 2)), ('b', (10, 3))]
This is almost the desired output, but you want to flatten the results. We can accomplish this by calling map
and returning a new tuple
with the desired format. The syntax (key,)
will create a one element tuple with just the key, which we add to the values.
You can also use the DataFrame API, by using pyspark.sql.DataFrame.toDF()
to convert your RDDs to DataFrames:
test1.toDF(["key", "value1"]).join(test2.toDF(["key", "value2"]), on="key").show()
#+---+------+------+
#|key|value1|value2|
#+---+------+------+
#| b| 10| 3|
#| a| 20| 2|
#+---+------+------+
Upvotes: 2