a meme
a meme

Reputation: 11

Questions on how to count frequency of keyword in pandas dataframes

Suppose I have a Dataframe where some entries are arrays/lists/sets:

Row 0 [a, d, g, h, j] 
Row 1 [d, j, a, p]    
Row 2 [c, f, t, a]    
Row 3 [a, h]
Row 4 []
Row 5 [d, j]

I want to transform it to a df where we have array entries mapped to the total frequency of their appearances (in that column only):

a c d f g h j p
4 1 3 1 1 2 3 1

How would I do that? (need it for bar plots).

Now suppose dataframe has another column:

Row 0 [a, d, g, h, j] 100
Row 1 [d, j, a, p]    300
Row 2 [c, f, t, a]    200
Row 3 [a, h]          160
Row 4 []               40    
Row 5 [d, j]          600

If I wanted to get the average value of col 2 per array entry, how would I do it? to clarify, I want:

a (100 + 300 + 200 + 160)/4 = 760/4 = 190
c 200/1 = 200
d (100 + 300)/2 = 200
f 200
... etc

Thanks, am learning pandas right now and it's really fun but also not quite trivial.

Upvotes: 0

Views: 77

Answers (2)

rahlf23
rahlf23

Reputation: 9019

First, let's recreate your dataframe:

import pandas as pd

df = pd.DataFrame({
    'Rows': ['Row 0','Row 1','Row 2','Row 3','Row 4','Row 5'],
    'Letters': [['a', 'd', 'g', 'h', 'j'], ['d', 'j', 'a', 'p'], ['c', 'f', 't', 'a'], ['a', 'h'], [], ['d', 'j']],
    'Values': [100, 300, 200, 160, 40, 600]
})

Yields:

    Rows          Letters  Values
0  Row 0  [a, d, g, h, j]     100
1  Row 1     [d, j, a, p]     300
2  Row 2     [c, f, t, a]     200
3  Row 3           [a, h]     160
4  Row 4               []      40
5  Row 5           [d, j]     600

Now, we can use explode() to convert your list-type column to individual rows and solve your first question:

df['Letters'].explode().value_counts().sort_index()

Yields:

a    4
c    1
d    3
f    1
g    1
h    2
j    3
p    1
t    1

For the second part, I would refer to @GrzegorzSkibinski's answer, which is how I would approach the problem. His solution efficiently combines both questions in your post:

df.explode('Letters').groupby('Letters').agg({'Values': 'mean'})

Yields:

             Values
Letters            
a        190.000000
c        200.000000
d        333.333333
f        200.000000
g        100.000000
h        130.000000
j        333.333333
p        300.000000
t        200.000000

Upvotes: 1

Georgina Skibinski
Georgina Skibinski

Reputation: 13387

With the inferred setup:

Setup:

data=[
["Row 0", list("adghj"), 100],
["Row 1", list("djap"), 300],
["Row 2", list("cfta"), 200],
["Row 3", list("ah"), 160],
["Row 4", [], 40],
["Row 5", list("dj"), 600]
]

import pandas as pd

df=pd.DataFrame(data, columns=["id", "wrds", "val"])

Solution for both questions:

df=df.explode("wrds")[["wrds", "val"]].groupby("wrds").agg(["count", "mean"])

Outputs:

       val
     count        mean
wrds
a        4  190.000000
c        1  200.000000
d        3  333.333333
f        1  200.000000
g        1  100.000000
h        2  130.000000
j        3  333.333333
p        1  300.000000
t        1  200.000000

Upvotes: 2

Related Questions