Reputation: 11
I have dataframe as : pandas dataframe
I'm consecutively grouping by 'Name' column for total counts, consecutive counts & for 'Age' column I'm applying min, max to generate dataframe as :
Then, I'm getting only first value of every column for each consecutive group as :
Then, I'm trying to get all column values where max 'Age' between 5-20 from each consecutive group is present and then, I'm trying to concat this dataframe with the dataframe which has first values. But I got the output as :
But the expected output is :
Also, this is for a single bin i.e., 5-20, how to include for more than 1 bins, for example, if 1 bin is 5-20 & next bin is 25-40, the expected output is :
For above outputs, this is the code what I have written :
import numpy as np
import pandas as pd
# initialize list of lists
data = [['tom', 10], ['tom', 5], ['nick', 15], ['juli', 14], ['tom', 20],['tom', 10], ['tom', 10], ['juli', 17], ['tom', 30], ['nick', 19], ['juli', 24], ['juli', 29],['tom', 0], ['juli', 76]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
# print dataframe.
print("df = ",df)
print("")
# acquire min, max, count, consecutive same names
df['min'] = df.groupby(df['Name'].ne(df['Name'].shift()).cumsum())['Age'].transform('min')#df.groupby("Name",sort=False)['Age'].transform('min')
df['max'] = df.groupby(df['Name'].ne(df['Name'].shift()).cumsum())['Age'].transform('max')#df.groupby("Name",sort=False)['Age'].transform('max')
df['count'] = df.groupby("Name",sort=False)['Name'].transform('count')
df['cons'] = df.groupby(df['Name'].ne(df['Name'].shift()).cumsum())['Name'].transform('size')
print(df)
# take the first column values of every consecutive group
df_t = df
temp_df=df.groupby(df['Name'].ne(df['Name'].shift()).cumsum(),as_index=False)[df.columns].agg('first')
print("")
print("temp_df = ",temp_df)
df_t = df_t.reset_index()
df_t = df_t.drop(['index'], axis=1)
print("df_t = ", df_t)
# check max of bin 5-20 for every consecutive group
df_t1 = df_t.groupby(df_t['Name'].ne(df_t['Name'].shift()).cumsum(),as_index=False).apply(lambda x:x['Age'][(x['Age'] >= 5) & (x['Age'] < 20)].agg(lambda y : y.idxmax()))
print("")
print("df_t1 = ", df_t1)
# checking for condition if value is np array
a = df_t1.tolist()
b=[]
c = np.array([2])
c = c.astype('int64')
for i in a:
if type(i)== type(c[0]):
b.append(i)
else:
continue
df_t1 = df_t.iloc[b]
print("")
print("output df_t1 = ", df_t1)
# concat the bin max and first value df
concatdf = pd.concat([temp_df, df_t1],axis=1)
print("")
print("concatdf = ", concatdf)
Thank you in advance :)
Upvotes: 1
Views: 183
Reputation: 260790
You can greatly simplify your code by doing a single groupby
for almost all indicators excepts the cumulated count.
Then just mask your data according to your criterion and concatenate.
I believe this is doing what you want:
group = df['Name'].ne(df['Name'].shift()).cumsum()
df2 = (df
.groupby(group, as_index=False)
.agg(**{'Name': ('Name', 'first'),
'Age': ('Age', 'first'),
'min': ('Age', 'min'),
'max': ('Age', 'max'),
'cons': ('Age', 'count')
})
.assign(count=lambda d: d.groupby('Name')['cons'].transform('sum'))
)
out = pd.concat([df2, df2.where(df2['max'].between(5,20))], axis=1)
output:
Name Age min max cons count Name Age min max cons count
0 tom 10 5 10 2 7 tom 10.0 5.0 10.0 2.0 7.0
1 nick 15 15 15 1 2 nick 15.0 15.0 15.0 1.0 2.0
2 juli 14 14 14 1 5 juli 14.0 14.0 14.0 1.0 5.0
3 tom 20 10 20 3 7 tom 20.0 10.0 20.0 3.0 7.0
4 juli 17 17 17 1 5 juli 17.0 17.0 17.0 1.0 5.0
5 tom 30 30 30 1 7 NaN NaN NaN NaN NaN NaN
6 nick 19 19 19 1 2 nick 19.0 19.0 19.0 1.0 2.0
7 juli 24 24 29 2 5 NaN NaN NaN NaN NaN NaN
8 tom 0 0 0 1 7 NaN NaN NaN NaN NaN NaN
9 juli 76 76 76 1 5 NaN NaN NaN NaN NaN NaN
For more bins:
bins = [(5,20), (25,40)]
out = pd.concat([df2]+[df2.where(df2['max'].between(a,b)) for a,b in bins], axis=1)
output:
Name Age min max cons count Name Age min max cons count Name Age min max cons count
0 tom 10 5 10 2 7 tom 10.0 5.0 10.0 2.0 7.0 NaN NaN NaN NaN NaN NaN
1 nick 15 15 15 1 2 nick 15.0 15.0 15.0 1.0 2.0 NaN NaN NaN NaN NaN NaN
2 juli 14 14 14 1 5 juli 14.0 14.0 14.0 1.0 5.0 NaN NaN NaN NaN NaN NaN
3 tom 20 10 20 3 7 tom 20.0 10.0 20.0 3.0 7.0 NaN NaN NaN NaN NaN NaN
4 juli 17 17 17 1 5 juli 17.0 17.0 17.0 1.0 5.0 NaN NaN NaN NaN NaN NaN
5 tom 30 30 30 1 7 NaN NaN NaN NaN NaN NaN tom 30.0 30.0 30.0 1.0 7.0
6 nick 19 19 19 1 2 nick 19.0 19.0 19.0 1.0 2.0 NaN NaN NaN NaN NaN NaN
7 juli 24 24 29 2 5 NaN NaN NaN NaN NaN NaN juli 24.0 24.0 29.0 2.0 5.0
8 tom 0 0 0 1 7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 juli 76 76 76 1 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Upvotes: 0