Reputation: 1
Assume that I have a column A, every row is a list that contains:
[{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}]
How do I count the number of "a"
s?
I would like a solution like F.map()
.
Many thanks
Upvotes: 0
Views: 909
Reputation: 5032
You can use a udf to achieve this , Assuming each row as you mentioned is a list with dictionaries -
import pyspark
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
from functools import partial
temp_df = spark.createDataFrame(
[
[[{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}]],
[[{"a":"10", "b":"2", "c":"3"}, {"a":"20", "b":"5", "c":"7"}]],
[[{"a":"10", "b":"2", "c":"3"}, {"a":"20", "b":"5", "c":"7"}]],
],
["A"]
)
def key_occurence(inp,key=None):
res = 0
for d in inp:
if key in d:
res += 1
return res
partial_func = partial(key_occurence, key="a")
key_occurence_udf = F.udf(partial_func,"int")
temp_df = temp_df.withColumn("A_occurence",key_occurence_udf("A"))
temp_df.show()
+--------------------+-----------+
| A|A_occurence|
+--------------------+-----------+
|[[a -> 1, b -> 2,...| 2|
|[[a -> 10, b -> 2...| 2|
|[[a -> 10, b -> 2...| 2|
+--------------------+-----------+
The udf additionally takes in a argument to check for the corresponding key
Upvotes: 0
Reputation: 417
Edited Answer:
Adjusting based on comment from OP. To get the occurrences of a particular key in a list of dictionaries, you can still use list comprehension (with a few adjustments):
A = [{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}]
A_count = len([y for x in A for y in x if y == 'a'])
print(A_count)
Output:
2
We're essentially using the same logic, just in this case we're using nested list comprehension. x
first iterates through A (the dictionaries), and y
iterates through x (specifically, the keys in each dictionary). Finally, we use an if
condition to make sure the key matches the specified value.
Old Answer: Not really sure this provides a solution like "map", but you can use list comprehension which is fairly straightforward:
A = [{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}]
A_sum = sum([int(x['a']) for x in A])
print(A_sum)
Output:
3
Explanation:
Essentially we are collecting the dictionary values based on your given key of 'a', parsing that value to a string, and then using sum
to add all the resulting values in that list. Some good reference material is on W3Schools.
Upvotes: 1