Reputation: 105
I have a dataframe that looks like this
id | date | year| cohort
17482| 2017-01-01| 2017| tier 1
17482| 2017-01-01| 2017| tier 2
17483| 2017-01-01| 2017| tier 2
17483| 2017-01-01| 2017| tier 2
I want to group by year and cohort and find the number of unique ids for each cohort per year. I use the following code
df.groupby(['year', 'cohort'])['id'].nunique().reset_index()
The output of the code is
year|cohort|id 2017|tier 1|400 2017|tier 2|600
However, there are only 500 unique ids in the year 2017. Therefore I want to find the IDs that are in both tier 1 and tier 2
Upvotes: 1
Views: 302
Reputation: 8219
We do a pivot on tier
column, indexed by id
first, counting how many times each id appears in each tier, by id/year
df2 = pd.pivot_table(df.assign(count = 1), index = ['id','year'], columns='cohort', values='count', aggfunc = 'count')
this produces
cohort tier 1 tier 2
id year
17482 2017 1.0 1.0
17483 2017 NaN 2.0
Now it is just a matter of reporting rows where both tier 1
and tier 2
are non NaN:
df2[(~df2['tier 1'].isna()) & (~df2['tier 2'].isna())]
output
cohort tier 1 tier 2
id year
17482 2017 1.0 1.0
then you can count them however you like it
Upvotes: 0
Reputation: 524
From your comment
I just need a count of ids that have a row with cohort = tier 1 AND cohort = tier 2 in a given year. So in the example above the only ID 17482 is in both cohorts for 2017
is this what you wanted?
df[df['cohort'].isin(['tier 1','tier 2'])].groupby(['year','id']).agg('count')
Upvotes: 0
Reputation: 108
Solution
data_year_cohort = df.groupby(["id","year", "cohort"])["id"].count().reset_index(name="individual")
count_unique = data_year_cohort.groupby(["id","year", "individual"])["id"].count().reset_index(name="cohorts_in_year")
request_matching_ids = count_unique.id.loc[(count_unique.cohorts_in_year == 2)]
len(request_matching_ids)
Explanation
There's two parts to the process.
Firstly, you want to group the data by ID, year and cohort to find each unique combination of those three.
Once you have that (second line of code), you can count these unique combinations from each ID and year. As there is two cohorts, a count of 2 will mean there is at least 1 'tier 1' also and at least 1 'tier 2' for the particular ID and year that show up in the resulting row(s).
You can then filter that final step if you'd like to only show the IDs with a count of 2 in the second step, which are the ones you're after based on your request.
And finally, as you specified in your follow up, you want a count of those IDs, which is as simple as using len() on the dataframe created in the third line of code.
Upvotes: 0
Reputation: 13831
Something like this?
import numpy as np
df['new_col'] = df.groupby(['year', 'cohort','id']).transform('count')
Upvotes: 0
Reputation: 26676
Question not clear. Please try and let us know. Happy to help further
df.groupby(['id','year','cohort']).size()
Upvotes: 1