join() in pyspark does not produce expected results

Question

num_of_words = (doc_title,num) #number of words in a document
lines = (doc_title,word,num_of_occurrences) #number of occurrences of a specific word in a document

When I called lines.join(num_of_words), I was expecting to get something like:

(doc_title,(word,num_of_occurrences,num))

but I got instead:

(doc_title,(word,num))

and num_of_occurrences was omitted. What did I do wrong here? How am I supposed to join these two RDDs to get the result I'm expecting?

christophetd · Accepted Answer

In the API docs of Spark for the join method:

join(other, numPartitions=None)

Return an RDD containing all pairs of elements with matching keys in self and other.

Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other.

So the join method can only be used on pairs (or at least will only return you a result of the described form).

A way to overcome this would be to have tuples of (doc_title, (word, num_occurrences)) instead of (doc_title, word, num_occurrences). Working example:

num_of_words = sc.parallelize([("harry potter", 4242)])
lines = sc.parallelize([("harry potter", ("wand", 100))])
result = lines.join(num_of_words)
print result.collect()
# [('harry potter', (('wand', 100), 4242))]

(Note that sc.parallelize only turns a local python collection into a Spark RDD, and that collect() does the exact opposite)

join() in pyspark does not produce expected results

Answers (1)

Related Questions