python_noob
python_noob

Reputation: 285

Groupby on two columns with bins(ranges) on one of them in Pandas Dataframe

I am trying to make segregate my data into buckets based on certain user attributes and I would like to see some counts in each of the buckets.For this I have imported this data into a Pandas Dataframe.

I have data that has user city, kids age and their unique id. I would like to know the count of users who reside in city A and have kids in age group 0-5.

Sample Data frame looks something like this:

city  kids_age  user_id
A         10       1  
B          4       2
A          4       3        
C          8       4
A          3       5 

Expected Output:

city   bin   count
A      0-5      2 
       5-10     1

B      0-5      1
       5-10     0

C      0-5      0
       5-10     1

I tried group by on two columns city and kids age:

user_details_df_cropped_1.groupby(['city', 'kids_age']).count()

It gave me an output that looks something like this:

city  kids_age  user_id   count
 A      10       1          1
         4       3          1
         3       5          1
 B       4       2          1 
 C       8       4          1

I returns me the users grouped by city, but not really by kids age bins(ranges). What am I missing here? Appreciate the help!!

Upvotes: 2

Views: 2626

Answers (1)

jezrael
jezrael

Reputation: 863801

Use cut for binning, pass to DataFrame.groupby, add 0 rows with DataFrame.stack DataFrame.unstack an last convert to DataFrame by Series.reset_index:

bins = [0,5,10]
labels = ['{}-{}'.format(i, j) for i, j in zip(bins[:-1], bins[1:])] 
b = pd.cut(df['kids_age'], bins=bins, labels=labels, include_lowest=True)

df = df.groupby(['city', b]).size().unstack(fill_value=0).stack().reset_index(name='count')
print (df)
  city kids_age  count
0    A      0-5      2
1    A     5-10      1
2    B      0-5      1
3    B     5-10      0
4    C      0-5      0
5    C     5-10      1

Another solution with DataFrame.reindex and MultiIndex.from_product for added mising rows filled by 0:

bins = [0,5,10]
labels = ['{}-{}'.format(i, j) for i, j in zip(bins[:-1], bins[1:])] 
b = pd.cut(df['kids_age'], bins=bins, labels=labels, include_lowest=True)
mux = pd.MultiIndex.from_product([df['city'].unique(), labels], names=['city','kids_age'])

df = (df.groupby(['city', b])
        .size()
        .reindex(mux, fill_value=0)
        .reset_index(name='count'))
print (df)
  city kids_age  count
0    A      0-5      2
1    A     5-10      1
2    B      0-5      1
3    B     5-10      0
4    C      0-5      0
5    C     5-10      1

Upvotes: 2

Related Questions