Count the number of specific key in Pyspark

Question

Assume that I have a column A, every row is a list that contains:

[{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}]

How do I count the number of "a"s?

I would like a solution like F.map().

Many thanks

Aelarion · Accepted Answer

Edited Answer:

Adjusting based on comment from OP. To get the occurrences of a particular key in a list of dictionaries, you can still use list comprehension (with a few adjustments):

A = [{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}]

A_count = len([y for x in A for y in x if y == 'a'])

print(A_count)

Output:

We're essentially using the same logic, just in this case we're using nested list comprehension. x first iterates through A (the dictionaries), and y iterates through x (specifically, the keys in each dictionary). Finally, we use an if condition to make sure the key matches the specified value.

Old Answer: Not really sure this provides a solution like "map", but you can use list comprehension which is fairly straightforward:

A = [{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}]

A_sum = sum([int(x['a']) for x in A])
print(A_sum)

Output:

Explanation:

Essentially we are collecting the dictionary values based on your given key of 'a', parsing that value to a string, and then using sum to add all the resulting values in that list. Some good reference material is on W3Schools.

Count the number of specific key in Pyspark

Answers (2)

Related Questions