Reputation: 331
I have a table like this:
In [2]: df = pd.DataFrame({
...: 'donorID':[101,101,101,102,103,101,101,102,103],
...: 'recipientID':[11,11,21,21,31,11,21,31,31],
...: 'amount':[100,200,500,200,200,300,200,200,100],
...: 'year':[2014,2014,2014,2014,2014,2015,2015,2015,2015]
...: })
In [3]: df
Out[3]:
amount donorID recipientID year
0 100 101 11 2014
1 200 101 11 2014
2 500 101 21 2014
3 200 102 21 2014
4 200 103 31 2014
5 300 101 11 2015
6 200 101 21 2015
7 200 102 31 2015
8 100 103 31 2015
I'd like to count the number of donor-recipient pairs by donor (donations made by the same donor to the same recipient in n years, where n could be any number and it doesn't have to be consecutive, but I use 2 here to keep things simple). In this case, donor 101 donated to recipient 11 and 21 in 2014 as well as in 2015, the count for 101 is 2. The number for 102 is 0, and for 103 is 1. The result table would look like this:
donorID num_donation_2_years
0 101 2
1 102 0
2 103 1
I've tried to use groupby and pivot_table but didn't manage to get the right answer. Any suggestion in pandas would be appreciated? Thanks!
Upvotes: 2
Views: 1564
Reputation: 7994
df_new = df.groupby(["donorID", "recipientID"])["year"].nunique().reset_index(name="year_count")
df_for_query = df_new.groupby(["donorID", "year_count"]).size().reset_index(name='numb_recipient')
donorID year_count numb_recipient
0 101 2 2
1 102 1 2
2 103 2 1
The third column is how many patients that fit the year condition. The line 0 says donor 101
has 2 patients that he/she donates in exactly two years. This is not exactly your output, but you can query it easily from this df.
If you want to find that the number of patients a donor donates for some number of year, say 2, run
df_for_query.query("year_count == 2")
donorID year_count numb_recipient
0 101 2 2
2 103 2 1
Thanks for Wen's inspiration for using nunique
!
Upvotes: 1
Reputation: 24565
Following code works (explanation as comments) (ol for outlist):
# count frequency of donor-recipient combination
ol = pd.value_counts(df.apply(lambda x: str(x.donorID)+str(x.recipientID), axis=1))
ol = ol[ol>=2] # choose only those >= 2
ol.index = list(map(lambda x: x[:3], ol.index)) # get donorID name again
print(pd.value_counts(ol.index)) # print desired frequency
Output:
101 2
103 1
dtype: int64
Upvotes: 0
Reputation: 323306
Something like
df1=df.groupby('donorID').apply(lambda x : x.groupby(x.recipientID).year.nunique().gt(1).sum())
df1
Out[102]:
donorID
101 2
102 0
103 1
dtype: int64
To get the dataframe
df1.to_frame('num_donation_2_years').reset_index()
Out[104]:
donorID num_donation_2_years
0 101 2
1 102 0
2 103 1
As Dark mention do not using apply
This is the update
df1=df.groupby(['donorID','recipientID']).year.nunique().gt(1).sum(level=0)
df1
Out[109]:
donorID
101 2.0
102 0.0
103 1.0
Name: year, dtype: float64
df1.to_frame('num_donation_2_years').reset_index()
Out[104]:
donorID num_donation_2_years
0 101 2
1 102 0
2 103 1
Upvotes: 3
Reputation: 30605
An improvement to @Wen's solution, avoiding apply
for more speed i.e
one = df.groupby(['donorID','recipientID'])['year'].nunique().gt(1)
two = one.groupby(level=0).sum().to_frame('no_of_donations_2_years').reset_index()
donorID no_of_donations_2_years
0 101 2.0
1 102 0.0
2 103 1.0
Upvotes: 3