Pineapple P
Pineapple P

Reputation: 83

How to filter elements in pandas group within a range

I have the following pandas DataFrame:

df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                   'x': ['A', 'A', 'A', 'A', 'C', 'C','E', 'G', 'G', 'G'],
                   'y': ['B', 'B', 'B', 'B', 'D', 'D', 'F', 'H', 'H', 'H'],
                   'year': [1990, 1991, 1992, 1993, 1994, 1999, 1999, 2001, 2002, 2010]})

   id  x  y  year
0   1  A  B  1990
1   2  A  B  1991
2   3  A  B  1992
3   4  A  B  1993
4   5  C  D  1994
5   6  C  D  1999
6   7  E  F  1999
7   8  G  H  2001
8   9  G  H  2002
9  10  G  H  2010

for each groupby(['x', 'y']), I need only the count of the ids in which the year figures between the min year of the group and year + 4

Expected result:

   x  y  count_id
0  A  B         4
1  C  D         1
2  E  F         1
3  G  H         2

Upvotes: 1

Views: 683

Answers (5)

Mahmoud Aly
Mahmoud Aly

Reputation: 751

Try this dude..

Here is your input

import pandas as pd
import numpy as np
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                   'x': ['A', 'A', 'A', 'A', 'C', 'C','E', 'G', 'G', 'G'],
                   'y': ['B', 'B', 'B', 'B', 'D', 'D', 'F', 'H', 'H', 'H'],
                   'year': [1990, 1991, 1992, 1993, 1994, 1999, 1999, 2001, 2002, 2010]})
df = df.sort_values('year')
groups = df[['x','y']].drop_duplicates().values
print(groups)

And here is the code:

output_groups = []
for group in groups:
  _min_year = None
  _ids = 0
  for row in df.values:
    _group = row[1:3]
    if (_group==group).all():
      _year = row[3]
      if _min_year == None:_min_year = _year
      if _year <= _min_year + 4:_ids+=1
  group = np.append(group,_ids)
  output_groups.append(group)
output_df=pd.DataFrame(output_groups,columns=['x','y','count_id'])  
print(output_df)

Upvotes: 0

anky
anky

Reputation: 75130

Adding another way.

Assuming the column x, y and year is sorted (if not you can use df.sort_values on these columns first) , you can also do a diff , then mask the first values and get sum of each group:

cols = ['x','y']
v = df['year'].diff().mask(df[cols].ne(df[cols].shift()).any(1))
o = df.assign(count_id=~v.gt(4)).groupby(cols,as_index=False)['count_id'].sum()

print(o)

   x  y  count_id
0  A  B         4
1  C  D         1
2  E  F         1
3  G  H         2

Upvotes: 2

Andrej Kesely
Andrej Kesely

Reputation: 195573

Using .groupby + Series.between:

x = (
    df.groupby(["x", "y"])
    .apply(lambda x: x["year"].between((mn := x["year"].min()), mn + 4).sum())
    .reset_index(name="count_id")
)
print(x)

Prints:

   x  y  count_id
0  A  B         4
1  C  D         1
2  E  F         1
3  G  H         2

Upvotes: 3

Henry Ecker
Henry Ecker

Reputation: 35686

We can create a mask based on where the year is less than or equal to 4 years from the group minumum (groupby transform, then filter and groupby to count the number of ids:

cols = ['x', 'y']
m = df['year'].sub(df.groupby(cols)['year'].transform('min')).le(4)
new_df = df[m].groupby(cols, as_index=False)['id'].count()

Or with nunique if wanting only unique id count:

cols = ['x', 'y']
m = df['year'].sub(df.groupby(cols)['year'].transform('min')).le(4)
new_df = df[m].groupby(cols, as_index=False)['id'].nunique()

new_df:

   x  y  id
0  A  B   4
1  C  D   1
2  E  F   1
3  G  H   2

Upvotes: 4

Henry Yik
Henry Yik

Reputation: 22503

You can groupby and apply:

print (df.groupby(["x","y"])["year"].apply(lambda d: (d-d.min()).le(4).sum()))

x  y
A  B    4
C  D    1
E  F    1
G  H    2
Name: year, dtype: int64

Upvotes: 3

Related Questions