Reputation:
num_of_words = (doc_title,num) #number of words in a document
lines = (doc_title,word,num_of_occurrences) #number of occurrences of a specific word in a document
When I called lines.join(num_of_words), I was expecting to get something like:
(doc_title,(word,num_of_occurrences,num))
but I got instead:
(doc_title,(word,num))
and num_of_occurrences was omitted. What did I do wrong here? How am I supposed to join these two RDDs to get the result I'm expecting?
Upvotes: 0
Views: 115
Reputation: 3874
In the API docs of Spark for the join
method:
join(other, numPartitions=None)
Return an RDD containing all pairs of elements with matching keys in self and other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other.
So the join
method can only be used on pairs (or at least will only return you a result of the described form).
A way to overcome this would be to have tuples of (doc_title, (word, num_occurrences)) instead of (doc_title, word, num_occurrences). Working example:
num_of_words = sc.parallelize([("harry potter", 4242)])
lines = sc.parallelize([("harry potter", ("wand", 100))])
result = lines.join(num_of_words)
print result.collect()
# [('harry potter', (('wand', 100), 4242))]
(Note that sc.parallelize only turns a local python collection into a Spark RDD, and that collect() does the exact opposite)
Upvotes: 1