what's the difference between rdd from PythonRDD and ParallelCollectionRDD

Question

I am learning how to program with Spark in Python and struggle with one problem.

The problem is that I have a PythonRDD loaded as id and description:

pythonRDD.take(1)
## [('b000jz4hqo', ['clickart', '950', '000', 'premier', 'image', 'pack', 'dvd', 'rom', 'broderbund'])]

And ParallelCollectionRDD loaded as id and description:

paraRDD.take(1)
## [('b000jz4hqo', ['clickart', '950', '000', 'premier', 'image', 'pack', 'dvd', 'rom', 'broderbund'])]

I can do a count on the paraRDD like this:

paraRDD.map(lambda l: (l[0],len(l[1]))).reduce(lambda a,b: a[1] + b[1])

or simply

paraRDD.reduce(lambda a,b: len(a[1]) + len(b[1]))

but on pythonRDD it ran into bug, the bug says:

"TypeError: 'int' object has no attribute 'getitem'".

def countTokens(vendorRDD):
    return vendorRDD.map(lambda l: (l[0],len(l[1]))).reduce(lambda a,b: a[1] + b[1])

Any idea on how this happened would be appreciated?!

Answers (1)