Jude
Jude

Reputation: 81

what's the difference between rdd from PythonRDD and ParallelCollectionRDD

I am learning how to program with Spark in Python and struggle with one problem.

The problem is that I have a PythonRDD loaded as id and description:

pythonRDD.take(1)
## [('b000jz4hqo', ['clickart', '950', '000', 'premier', 'image', 'pack', 'dvd', 'rom', 'broderbund'])]

And ParallelCollectionRDD loaded as id and description:

paraRDD.take(1)
## [('b000jz4hqo', ['clickart', '950', '000', 'premier', 'image', 'pack', 'dvd', 'rom', 'broderbund'])]

I can do a count on the paraRDD like this:

paraRDD.map(lambda l: (l[0],len(l[1]))).reduce(lambda a,b: a[1] + b[1])

or simply

paraRDD.reduce(lambda a,b: len(a[1]) + len(b[1]))

but on pythonRDD it ran into bug, the bug says:

"TypeError: 'int' object has no attribute 'getitem'".

def countTokens(vendorRDD):
    return vendorRDD.map(lambda l: (l[0],len(l[1]))).reduce(lambda a,b: a[1] + b[1])

Any idea on how this happened would be appreciated?!

Upvotes: 3

Views: 3082

Answers (1)

zero323
zero323

Reputation: 330283

Difference between PythonRDD and ParallelCollectionRDD is completely irrelevant here. Your code is just wrong.

reduce method takes an associative and commutative function with the following signature:

(T, T) => T

In other words both arguments and returned object have to be of the same type and an order of operations and a parenthesizing cannot affect the final result. Function you pass to the reduce simply doesn't satisfy these criteria.

To make it work you'll need something like this:

rdd.map(lambda l: len(l[1])).reduce(lambda x, y: x + y)

or even better:

from operator import add

rdd.values().map(len).reduce(add)

Upvotes: 1

Related Questions