Pryderide
Pryderide

Reputation: 153

Pandas 're-binning' a DataFrame

A DataFrame containing data with age binned in separate rows, as below:

 VALUE,AGE
 10,   0-4
 20,   5-9
 30,  10-14
 40,  15-19
 .. ..   .....

So, basically, the age is grouped in 5 year bins. I'd like to have 10 year bins, that is, 0-9,10-19 etc. What I'm after is the VALUE, but for 10-year based age bins, so the values would be:

VALUE,AGE
30,  0-9
70, 10-19

I can do it by shifting and adding, and taking every second row of the resulting dataframe, but is there any smart, more general way built into Pandas to do this ?

Upvotes: 0

Views: 1442

Answers (2)

Tom
Tom

Reputation: 8790

Here's a "dumb" version, based on this answer - just sum every 2 rows:

In[0]
df.groupby(df.index // 2).sum()

Out[0]:
   VALUE
0     30
1     70

I say "dumb" because this method doesn't factor in the age cut offs, it just happens to align with them. So say if the age ranges are variable, or if you have data that start at 5-9 instead of 0-4, this will likely cause an issue. You also have to rename the index as it is unclear.

A "smarter" version would be to actually create bins with pd.cut and use that to group the data, based on the ages for each row:

In[0]
df['MAX_AGE'] = df['AGE'].str.split('-').str[-1].astype(int)

bins = [0,10,20]
out = df.groupby(pd.cut(df['MAX_AGE'], bins=bins, right=False)).sum().drop('MAX_AGE',axis=1)

Out[0]:
          VALUE
    AGE        
(0, 10]      30
(10, 20]     70

Explanation:

  • Use pandas.Series.str methods to get out the maximum age for each row, store in a column "MAX_AGE"
  • Create bins at 10 year cut offs
  • Use pd.cut to assign the data into bins based on the max age of each row. Then use groupby on these bins and sum. Note that since we specify right = False, the bins depicted in the index should mean 0-9 and 10-19.

For reference, here is the data I was using:

import pandas as pd
    
VALUE = [10,20,30,40,]
AGE = ['0-4','5-9','10-14','15-19']

df = pd.DataFrame({'VALUE':VALUE,
                   'AGE':AGE})

Upvotes: 1

rhug123
rhug123

Reputation: 8768

This should work as long as they are all in 5 year increments. This will find where the upper number is uneven and group it with what came before, stopping at the last uneven number.

Below splits the string to get the numerical value

df['lower'] = df['AGE'].str.split('-').str[0]
df['upper'] = df['AGE'].str.split('-').str[1]
df[['lower','upper']] = df[['lower','upper']].astype(int)

Then it will apply the grouping logic, and rename the columns to represent the desired time period.

df['VALUE'] = df.groupby((df['upper'] % 2 == 1).shift().fillna(0).cumsum())['VALUE'].transform('sum')
df = df.drop_duplicates(subset = ['VALUE'],keep = 'last')
df['lower'] = df['lower'] - 5
df[['lower','upper']] = df[['lower','upper']].astype(str)
df['AGE'] = df['lower'] + '-' + df['upper']
df = df.drop(columns = ['lower','upper'])

Upvotes: 0

Related Questions