ltyrvol
ltyrvol

Reputation: 121

Pandas how add a new column to dataframe based on values from all rows, specific columns values applied to whole dataframe

I working on a pandas DataFrame which needs a new column that shows count of specific values in specific columns.

I tried various combinations groupby and pivot, but had problems to apply it to whole dataframe without errors.

df = pd.DataFrame([
    ['a', 'z'],
    ['a', 'x'],
    ['a', 'y'],
    ['b', 'v'],
    ['b', 'x'],
    ['b', 'v']],
  columns=['col1', 'col2'])

I need to add col3 that counts 'v' values in col2 for each value in 'col1'. There is no 'v' in col2 for 'a' in col1, so it's 0 everywhere, while expected value count is 2 for 'b', also in a row where value in col2 equals 'x' instead of 'v'.

Expected output:

['a', 'z', 0]
['a', 'x', 0]
['a', 'y', 0]
['b', 'v', 2]
['b', 'x', 2]
['b', 'v', 2]

I'm looking rather for a nice pandas specific solution because the original dataframe is quite big, so things like row iterations and time expensive.

Upvotes: 2

Views: 541

Answers (3)

Mark Wang
Mark Wang

Reputation: 2757

All the answers above are fine. The only caveat is that transform can be slow when the group size is very large. Alternatively, you can try the workaround below,

(df.assign(mask = lambda x:x.col2.eq('v'))
   .pipe(lambda x:x.join(x.groupby('col1')['mask'].sum().map(int).rename('col3'),on='col1')))

Upvotes: 0

Ankur Sinha
Ankur Sinha

Reputation: 6669

While ALollz's answer is neat and a one liner, here is another one although a two step solution introducing you to other concepts like str.contains and np.where!

First get the rows which have v using np.where and mark them as a flag:

df['col3'] = np.where(df['col2'].str.contains('v'), 1, 0)

Now perform a groupby on col1 and sum them:

df['col3'] = df.groupby('col1')['col3'].transform('sum')

Output:

  col1 col2  col3
0    a    z     0
1    a    x     0
2    a    y     0
3    b    v     2
4    b    x     2
5    b    v     2

Upvotes: 0

ALollz
ALollz

Reputation: 59579

Create a Boolean Series checking the equality then groupby +transform + sum to count them.

df['col3'] = df.col2.eq('v').astype(int).groupby(df.col1).transform('sum')  

#  col1 col2  col3
#0    a    z     0
#1    a    x     0
#2    a    y     0
#3    b    v     2
#4    b    x     2
#5    b    v     2

Upvotes: 1

Related Questions