Reputation: 153
A DataFrame containing data with age binned in separate rows, as below:
VALUE,AGE 10, 0-4 20, 5-9 30, 10-14 40, 15-19 .. .. .....
So, basically, the age is grouped in 5 year bins. I'd like to have 10 year bins, that is, 0-9,10-19 etc. What I'm after is the VALUE, but for 10-year based age bins, so the values would be:
VALUE,AGE 30, 0-9 70, 10-19
I can do it by shifting and adding, and taking every second row of the resulting dataframe, but is there any smart, more general way built into Pandas to do this ?
Upvotes: 0
Views: 1442
Reputation: 8790
Here's a "dumb" version, based on this answer - just sum every 2 rows:
In[0]
df.groupby(df.index // 2).sum()
Out[0]:
VALUE
0 30
1 70
I say "dumb" because this method doesn't factor in the age cut offs, it just happens to align with them. So say if the age ranges are variable, or if you have data that start at 5-9 instead of 0-4, this will likely cause an issue. You also have to rename the index as it is unclear.
A "smarter" version would be to actually create bins with pd.cut
and use that to group the data, based on the ages for each row:
In[0]
df['MAX_AGE'] = df['AGE'].str.split('-').str[-1].astype(int)
bins = [0,10,20]
out = df.groupby(pd.cut(df['MAX_AGE'], bins=bins, right=False)).sum().drop('MAX_AGE',axis=1)
Out[0]:
VALUE
AGE
(0, 10] 30
(10, 20] 70
Explanation:
pandas.Series.str
methods to get out the maximum age for each row,
store in a column "MAX_AGE"
bins
at 10 year cut offspd.cut
to assign the data into bins
based on the max age of each row. Then use groupby
on these bins and sum. Note that since we specify right = False
, the bins depicted in the index should mean 0-9 and 10-19.For reference, here is the data I was using:
import pandas as pd
VALUE = [10,20,30,40,]
AGE = ['0-4','5-9','10-14','15-19']
df = pd.DataFrame({'VALUE':VALUE,
'AGE':AGE})
Upvotes: 1
Reputation: 8768
This should work as long as they are all in 5 year increments. This will find where the upper number is uneven and group it with what came before, stopping at the last uneven number.
Below splits the string to get the numerical value
df['lower'] = df['AGE'].str.split('-').str[0]
df['upper'] = df['AGE'].str.split('-').str[1]
df[['lower','upper']] = df[['lower','upper']].astype(int)
Then it will apply the grouping logic, and rename the columns to represent the desired time period.
df['VALUE'] = df.groupby((df['upper'] % 2 == 1).shift().fillna(0).cumsum())['VALUE'].transform('sum')
df = df.drop_duplicates(subset = ['VALUE'],keep = 'last')
df['lower'] = df['lower'] - 5
df[['lower','upper']] = df[['lower','upper']].astype(str)
df['AGE'] = df['lower'] + '-' + df['upper']
df = df.drop(columns = ['lower','upper'])
Upvotes: 0