dsexplorer
dsexplorer

Reputation: 105

Group by columns and find IDs in both groups

I have a dataframe that looks like this

id   | date      | year| cohort
17482| 2017-01-01| 2017| tier 1 
17482| 2017-01-01| 2017| tier 2
17483| 2017-01-01| 2017| tier 2 
17483| 2017-01-01| 2017| tier 2

I want to group by year and cohort and find the number of unique ids for each cohort per year. I use the following code

df.groupby(['year', 'cohort'])['id'].nunique().reset_index()

The output of the code is

year|cohort|id 2017|tier 1|400 2017|tier 2|600

However, there are only 500 unique ids in the year 2017. Therefore I want to find the IDs that are in both tier 1 and tier 2

Upvotes: 1

Views: 302

Answers (5)

piterbarg
piterbarg

Reputation: 8219

We do a pivot on tier column, indexed by id first, counting how many times each id appears in each tier, by id/year

df2 = pd.pivot_table(df.assign(count = 1), index = ['id','year'], columns='cohort', values='count', aggfunc = 'count')

this produces


cohort          tier 1  tier 2
id      year        
17482   2017    1.0     1.0
17483   2017    NaN     2.0

Now it is just a matter of reporting rows where both tier 1 and tier 2 are non NaN:

df2[(~df2['tier 1'].isna()) & (~df2['tier 2'].isna())]

output

    cohort      tier 1  tier 2
id      year        
17482   2017    1.0     1.0

then you can count them however you like it

Upvotes: 0

d-xa
d-xa

Reputation: 524

From your comment

I just need a count of ids that have a row with cohort = tier 1 AND cohort = tier 2 in a given year. So in the example above the only ID 17482 is in both cohorts for 2017

is this what you wanted?

df[df['cohort'].isin(['tier 1','tier 2'])].groupby(['year','id']).agg('count')

Upvotes: 0

erskyne
erskyne

Reputation: 108

Solution

data_year_cohort = df.groupby(["id","year", "cohort"])["id"].count().reset_index(name="individual") 
count_unique = data_year_cohort.groupby(["id","year", "individual"])["id"].count().reset_index(name="cohorts_in_year")
request_matching_ids = count_unique.id.loc[(count_unique.cohorts_in_year == 2)]
len(request_matching_ids)

Explanation

There's two parts to the process.

Firstly, you want to group the data by ID, year and cohort to find each unique combination of those three.

Once you have that (second line of code), you can count these unique combinations from each ID and year. As there is two cohorts, a count of 2 will mean there is at least 1 'tier 1' also and at least 1 'tier 2' for the particular ID and year that show up in the resulting row(s).

You can then filter that final step if you'd like to only show the IDs with a count of 2 in the second step, which are the ones you're after based on your request.

And finally, as you specified in your follow up, you want a count of those IDs, which is as simple as using len() on the dataframe created in the third line of code.

Upvotes: 0

sophocles
sophocles

Reputation: 13831

Something like this?

import numpy as np
df['new_col'] = df.groupby(['year', 'cohort','id']).transform('count')

Upvotes: 0

wwnde
wwnde

Reputation: 26676

Question not clear. Please try and let us know. Happy to help further

df.groupby(['id','year','cohort']).size()

Upvotes: 1

Related Questions