Dang Tran
Dang Tran

Reputation: 1

Count the number of specific key in Pyspark

Assume that I have a column A, every row is a list that contains:

[{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}]

How do I count the number of "a"s?

I would like a solution like F.map().

Many thanks

Upvotes: 0

Views: 909

Answers (2)

Vaebhav
Vaebhav

Reputation: 5032

You can use a udf to achieve this , Assuming each row as you mentioned is a list with dictionaries -

import pyspark

from pyspark.sql import SQLContext
import pyspark.sql.functions as F

from functools import partial

temp_df = spark.createDataFrame(
    [
        [[{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}]],
        [[{"a":"10", "b":"2", "c":"3"}, {"a":"20", "b":"5", "c":"7"}]],
        [[{"a":"10", "b":"2", "c":"3"}, {"a":"20", "b":"5", "c":"7"}]],
    ],
    ["A"]
)

def key_occurence(inp,key=None):
    res = 0
    for d in inp:
      if key in d:
        res += 1
    return res

partial_func = partial(key_occurence, key="a")

key_occurence_udf = F.udf(partial_func,"int")

temp_df = temp_df.withColumn("A_occurence",key_occurence_udf("A"))

temp_df.show()

+--------------------+-----------+
|                   A|A_occurence|
+--------------------+-----------+
|[[a -> 1, b -> 2,...|          2|
|[[a -> 10, b -> 2...|          2|
|[[a -> 10, b -> 2...|          2|
+--------------------+-----------+

The udf additionally takes in a argument to check for the corresponding key

Upvotes: 0

Aelarion
Aelarion

Reputation: 417

Edited Answer:

Adjusting based on comment from OP. To get the occurrences of a particular key in a list of dictionaries, you can still use list comprehension (with a few adjustments):

A = [{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}]

A_count = len([y for x in A for y in x if y == 'a'])

print(A_count)

Output:

2

We're essentially using the same logic, just in this case we're using nested list comprehension. x first iterates through A (the dictionaries), and y iterates through x (specifically, the keys in each dictionary). Finally, we use an if condition to make sure the key matches the specified value.


Old Answer: Not really sure this provides a solution like "map", but you can use list comprehension which is fairly straightforward:

A = [{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}]

A_sum = sum([int(x['a']) for x in A])
print(A_sum)

Output:

3

Explanation:

Essentially we are collecting the dictionary values based on your given key of 'a', parsing that value to a string, and then using sum to add all the resulting values in that list. Some good reference material is on W3Schools.

Upvotes: 1

Related Questions