PySpark count groupby with None keys

Question

I have a spark RDD object (using pyspark) and I'm trying to get the equivalent of SQL's

SELECT MY_FIELD COUNT(*) GROUP BY MY_FIELD

So I've tried the following code:

my_groupby_count = myRDD.map(lambda x: x.type).reduceByKey(lambda x, y: x + y).collect()
# 'type' is the name of the field inside the RDD row

But I'm getting an error instead, which I'm not sure how to handle:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
 in ()
----> 1 my_groupby_count = myRDD.map(lambda x: x.type).reduceByKey(lambda x, y: x +     y).collect()

/root/spark/python/pyspark/rdd.py in collect(self)
         
         with SCCallSiteSync(self.context) as css:
-->              port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
         return list(_load_from_socket(port, self._jrdd_deserializer))
 
/root/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
         answer = self.gateway_client.send_command(command)
         return_value = get_return_value(
->              answer, self.gateway_client, self.target_id, self.name)

Now, since this method has worked well for me before, I'm suspecting it might have to do with something related to the data itself. For instance, I know some values in x.type are None, but I'm not sure how to get rid of them.

Any ideas how to proceed investigating this? P.S. toDF() fails as well, I imagine due to whatever same reason. Also, I'd solutions for RDD and not DataFrame. Thanks

Artem Aliev · Accepted Answer

you need to provide tuple reduceByKey. Looks like you just forgot '()'

myRDD.map(lambda x: (x.type, 1)).reduceByKey(lambda x, y: x + y).collect()

Side note: shorter version of the same code with countByKey()

myRDD.map(lambda x: (x.type,)).countByKey()

PySpark count groupby with None keys

Answers (1)

Related Questions