Reputation: 511

Groupby pandas but perform calculation on multiple columns

I have a dataframe like below:

name	date	col1	col2
A	2021-03-01	0	1
A	2021-03-02	0	0
A	2021-03-03	3	1
A	2021-03-04	1	0
A	2021-03-05	3	1
A	2021-03-06	1	0
B	2021-03-01	1	0
B	2021-03-02	2	0
B	2021-03-03	3	1
B	2021-03-04	0	1
B	2021-03-05	0	0
B	2021-03-06	0	0

I'd like to group by the names and find the number of days spanned by the nonzero entries of the other non-date columns (basically excluding any leading or trailing zeroes) to get something like:

name	col1	col2
A	4	5
B	3	2

How can I do this without resorting to a for loop?

Upvotes: 4

Answers (3)

ALollz

Reputation: 59549

Here's a method that might scale a lot better for many groups [it avoids the apply(lambda x:) in favor of multiple groubpy builtin operations.]

Check for values !=0 then take a cummax within group in both directions to label the spans with True. Then groupby + sum to get the size.

cols = ['col1', 'col2']
df1 = df[cols].ne(0)

df1 = df1.groupby(df['name']).cummax() & df1[::-1].groupby(df['name']).cummax()
df1 = df1.groupby(df['name']).sum()

print(df1)

      col1  col2
name            
A        4     5
B        3     2

Upvotes: 0

Andrej Kesely

Reputation: 195448

Another solution:

def fn(x):
    i = x[(x != 0)].index
    return i[-1] - i[0] + 1 if len(i) > 0 else 0


print(df.groupby("name").agg({"col1": fn, "col2": fn}))

Prints:

      col1  col2
name            
A        4     5
B        3     2

Upvotes: 1

Stefan B

Reputation: 1677

I think, np.trim_zeros is what you are looking for:

>>> import numpy as np; import pandas as pd
>>> df = pd.DataFrame.from_dict({'name': ['A']*6 + ['B']*6, 'col1': [0, 0, 3, 1, 3, 1, 1, 2, 3, 0, 0, 0], 'col2': [1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0]})

>>> df
   name  col1  col2
0     A     0     1
1     A     0     0
2     A     3     1
3     A     1     0
4     A     3     1
5     A     1     0
6     B     1     0
7     B     2     0
8     B     3     1
9     B     0     1
10    B     0     0
11    B     0     0

>>> df.groupby('name').aggregate(lambda x: len(np.trim_zeros(x))).reset_index()
  name  col1  col2
0    A     4     5
1    B     3     2

Upvotes: 4

Groupby pandas but perform calculation on multiple columns

Answers (3)

Related Questions