Reputation: 81
I am learning how to program with Spark in Python and struggle with one problem.
The problem is that I have a PythonRDD loaded as id and description:
pythonRDD.take(1)
## [('b000jz4hqo', ['clickart', '950', '000', 'premier', 'image', 'pack', 'dvd', 'rom', 'broderbund'])]
And ParallelCollectionRDD loaded as id and description:
paraRDD.take(1)
## [('b000jz4hqo', ['clickart', '950', '000', 'premier', 'image', 'pack', 'dvd', 'rom', 'broderbund'])]
I can do a count on the paraRDD like this:
paraRDD.map(lambda l: (l[0],len(l[1]))).reduce(lambda a,b: a[1] + b[1])
or simply
paraRDD.reduce(lambda a,b: len(a[1]) + len(b[1]))
but on pythonRDD it ran into bug, the bug says:
"TypeError: 'int' object has no attribute 'getitem'".
def countTokens(vendorRDD):
return vendorRDD.map(lambda l: (l[0],len(l[1]))).reduce(lambda a,b: a[1] + b[1])
Any idea on how this happened would be appreciated?!
Upvotes: 3
Views: 3082
Reputation: 330283
Difference between PythonRDD
and ParallelCollectionRDD
is completely irrelevant here. Your code is just wrong.
reduce
method takes an associative and commutative function with the following signature:
(T, T) => T
In other words both arguments and returned object have to be of the same type and an order of operations and a parenthesizing cannot affect the final result. Function you pass to the reduce
simply doesn't satisfy these criteria.
To make it work you'll need something like this:
rdd.map(lambda l: len(l[1])).reduce(lambda x, y: x + y)
or even better:
from operator import add
rdd.values().map(len).reduce(add)
Upvotes: 1