J Doe
J Doe

Reputation: 37

pyspark join two rdds and flatten the results

Environment is pyspark, Spark Version 2.2.

We have two rdds test1 and test2, below are sample data

test1 = [('a', 20), ('b', 10), ('c', 2)]
test2 = [('a', 2), ('b', 3)]

Now we want to generate output1 as below, any help is appreciated.

[('a', 20, 2), ('b', 10, 3)]

Upvotes: 2

Views: 1326

Answers (1)

pault
pault

Reputation: 43494

You can accomplish this with a simple join followed by a call to map to flatten the values.

test1.join(test2).map(lambda (key, values): (key,) + values).collect()
#[('a', 20, 2), ('b', 10, 3)]

To explain, the result of the join is the following:

test1.join(test2).collect()
#[('a', (20, 2)), ('b', (10, 3))]

This is almost the desired output, but you want to flatten the results. We can accomplish this by calling map and returning a new tuple with the desired format. The syntax (key,) will create a one element tuple with just the key, which we add to the values.

You can also use the DataFrame API, by using pyspark.sql.DataFrame.toDF() to convert your RDDs to DataFrames:

test1.toDF(["key", "value1"]).join(test2.toDF(["key", "value2"]), on="key").show()
#+---+------+------+
#|key|value1|value2|
#+---+------+------+
#|  b|    10|     3|
#|  a|    20|     2|
#+---+------+------+

Upvotes: 2

Related Questions