Reputation: 83
I have the following pandas DataFrame:
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'x': ['A', 'A', 'A', 'A', 'C', 'C','E', 'G', 'G', 'G'],
'y': ['B', 'B', 'B', 'B', 'D', 'D', 'F', 'H', 'H', 'H'],
'year': [1990, 1991, 1992, 1993, 1994, 1999, 1999, 2001, 2002, 2010]})
id x y year
0 1 A B 1990
1 2 A B 1991
2 3 A B 1992
3 4 A B 1993
4 5 C D 1994
5 6 C D 1999
6 7 E F 1999
7 8 G H 2001
8 9 G H 2002
9 10 G H 2010
for each groupby(['x', 'y'])
, I need only the count of the ids in which the year figures between the min year of the group and year + 4
Expected result:
x y count_id
0 A B 4
1 C D 1
2 E F 1
3 G H 2
Upvotes: 1
Views: 683
Reputation: 751
Try this dude..
Here is your input
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'x': ['A', 'A', 'A', 'A', 'C', 'C','E', 'G', 'G', 'G'],
'y': ['B', 'B', 'B', 'B', 'D', 'D', 'F', 'H', 'H', 'H'],
'year': [1990, 1991, 1992, 1993, 1994, 1999, 1999, 2001, 2002, 2010]})
df = df.sort_values('year')
groups = df[['x','y']].drop_duplicates().values
print(groups)
And here is the code:
output_groups = []
for group in groups:
_min_year = None
_ids = 0
for row in df.values:
_group = row[1:3]
if (_group==group).all():
_year = row[3]
if _min_year == None:_min_year = _year
if _year <= _min_year + 4:_ids+=1
group = np.append(group,_ids)
output_groups.append(group)
output_df=pd.DataFrame(output_groups,columns=['x','y','count_id'])
print(output_df)
Upvotes: 0
Reputation: 75130
Adding another way.
Assuming the column x
, y
and year
is sorted (if not you can use df.sort_values
on these columns first) , you can also do a diff
, then mask the first values and get sum of each group:
cols = ['x','y']
v = df['year'].diff().mask(df[cols].ne(df[cols].shift()).any(1))
o = df.assign(count_id=~v.gt(4)).groupby(cols,as_index=False)['count_id'].sum()
print(o)
x y count_id
0 A B 4
1 C D 1
2 E F 1
3 G H 2
Upvotes: 2
Reputation: 195573
Using .groupby
+ Series.between
:
x = (
df.groupby(["x", "y"])
.apply(lambda x: x["year"].between((mn := x["year"].min()), mn + 4).sum())
.reset_index(name="count_id")
)
print(x)
Prints:
x y count_id
0 A B 4
1 C D 1
2 E F 1
3 G H 2
Upvotes: 3
Reputation: 35686
We can create a mask based on where the year is less than or equal to 4 years from the group minumum (groupby transform
, then filter and groupby to count
the number of ids:
cols = ['x', 'y']
m = df['year'].sub(df.groupby(cols)['year'].transform('min')).le(4)
new_df = df[m].groupby(cols, as_index=False)['id'].count()
Or with nunique
if wanting only unique id count:
cols = ['x', 'y']
m = df['year'].sub(df.groupby(cols)['year'].transform('min')).le(4)
new_df = df[m].groupby(cols, as_index=False)['id'].nunique()
new_df
:
x y id
0 A B 4
1 C D 1
2 E F 1
3 G H 2
Upvotes: 4
Reputation: 22503
You can groupby
and apply
:
print (df.groupby(["x","y"])["year"].apply(lambda d: (d-d.min()).le(4).sum()))
x y
A B 4
C D 1
E F 1
G H 2
Name: year, dtype: int64
Upvotes: 3