Reputation: 951
I have a spark RDD object (using pyspark) and I'm trying to get the equivalent of SQL's
SELECT MY_FIELD COUNT(*) GROUP BY MY_FIELD
So I've tried the following code:
my_groupby_count = myRDD.map(lambda x: x.type).reduceByKey(lambda x, y: x + y).collect()
# 'type' is the name of the field inside the RDD row
But I'm getting an error instead, which I'm not sure how to handle:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-24-23b4c60c6fd6> in <module>()
----> 1 my_groupby_count = myRDD.map(lambda x: x.type).reduceByKey(lambda x, y: x + y).collect()
/root/spark/python/pyspark/rdd.py in collect(self)
with SCCallSiteSync(self.context) as css:
--> port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
return list(_load_from_socket(port, self._jrdd_deserializer))
/root/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
answer = self.gateway_client.send_command(command)
return_value = get_return_value(
-> answer, self.gateway_client, self.target_id, self.name)
Now, since this method has worked well for me before, I'm suspecting it might have to do with something related to the data itself. For instance, I know some values in x.type are None, but I'm not sure how to get rid of them.
Any ideas how to proceed investigating this? P.S. toDF() fails as well, I imagine due to whatever same reason. Also, I'd solutions for RDD and not DataFrame. Thanks
Upvotes: 0
Views: 211
Reputation: 1407
you need to provide tuple reduceByKey. Looks like you just forgot '()'
myRDD.map(lambda x: (x.type, 1)).reduceByKey(lambda x, y: x + y).collect()
Side note: shorter version of the same code with countByKey()
myRDD.map(lambda x: (x.type,)).countByKey()
Upvotes: 1