Unet Model
Unet Model

Reputation: 11

How to get all column's first values of consecutive groups and also max of different bins of the respective group in pandas dataframe?

I have dataframe as : pandas dataframe

I'm consecutively grouping by 'Name' column for total counts, consecutive counts & for 'Age' column I'm applying min, max to generate dataframe as : groupby & min, max, total count, cons count

Then, I'm getting only first value of every column for each consecutive group as : first values

Then, I'm trying to get all column values where max 'Age' between 5-20 from each consecutive group is present and then, I'm trying to concat this dataframe with the dataframe which has first values. But I got the output as : concat df

But the expected output is :

expected output

Also, this is for a single bin i.e., 5-20, how to include for more than 1 bins, for example, if 1 bin is 5-20 & next bin is 25-40, the expected output is : more bins output

For above outputs, this is the code what I have written :

import numpy as np
import pandas as pd

# initialize list of lists
data = [['tom', 10], ['tom', 5], ['nick', 15], ['juli', 14], ['tom', 20],['tom', 10], ['tom', 10], ['juli', 17], ['tom', 30], ['nick', 19], ['juli', 24], ['juli', 29],['tom', 0], ['juli', 76]]

# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])


# print dataframe.
print("df = ",df)
print("")

# acquire min, max, count, consecutive same names
df['min'] = df.groupby(df['Name'].ne(df['Name'].shift()).cumsum())['Age'].transform('min')#df.groupby("Name",sort=False)['Age'].transform('min')
df['max'] = df.groupby(df['Name'].ne(df['Name'].shift()).cumsum())['Age'].transform('max')#df.groupby("Name",sort=False)['Age'].transform('max')
df['count'] = df.groupby("Name",sort=False)['Name'].transform('count')
df['cons'] = df.groupby(df['Name'].ne(df['Name'].shift()).cumsum())['Name'].transform('size')

print(df)

# take the first column values of every consecutive group
df_t = df
temp_df=df.groupby(df['Name'].ne(df['Name'].shift()).cumsum(),as_index=False)[df.columns].agg('first')

print("")
print("temp_df = ",temp_df)

df_t = df_t.reset_index()
df_t = df_t.drop(['index'], axis=1)
print("df_t = ", df_t)

# check max of bin 5-20 for every consecutive group
df_t1 = df_t.groupby(df_t['Name'].ne(df_t['Name'].shift()).cumsum(),as_index=False).apply(lambda  x:x['Age'][(x['Age'] >= 5) & (x['Age'] < 20)].agg(lambda y : y.idxmax()))
print("")
print("df_t1 = ", df_t1)

# checking for condition if value is np array
a = df_t1.tolist()
b=[]
c = np.array([2])
c = c.astype('int64')
for i in a:
    if type(i)== type(c[0]):
        b.append(i)
    else:
        continue

df_t1 = df_t.iloc[b]
print("")
print("output df_t1 = ", df_t1)

# concat the bin max and first value df
concatdf = pd.concat([temp_df, df_t1],axis=1)
print("")
print("concatdf = ", concatdf)

Thank you in advance :)

Upvotes: 1

Views: 183

Answers (1)

mozway
mozway

Reputation: 260790

You can greatly simplify your code by doing a single groupby for almost all indicators excepts the cumulated count.

Then just mask your data according to your criterion and concatenate.

I believe this is doing what you want:

group = df['Name'].ne(df['Name'].shift()).cumsum()

df2 = (df
   .groupby(group, as_index=False)
   .agg(**{'Name': ('Name', 'first'),
           'Age': ('Age', 'first'),
           'min': ('Age', 'min'),
           'max': ('Age', 'max'),
           'cons': ('Age', 'count')
            })
   .assign(count=lambda d: d.groupby('Name')['cons'].transform('sum'))
)

out = pd.concat([df2, df2.where(df2['max'].between(5,20))], axis=1)

output:

   Name  Age  min  max  cons  count  Name   Age   min   max  cons  count
0   tom   10    5   10     2      7   tom  10.0   5.0  10.0   2.0    7.0
1  nick   15   15   15     1      2  nick  15.0  15.0  15.0   1.0    2.0
2  juli   14   14   14     1      5  juli  14.0  14.0  14.0   1.0    5.0
3   tom   20   10   20     3      7   tom  20.0  10.0  20.0   3.0    7.0
4  juli   17   17   17     1      5  juli  17.0  17.0  17.0   1.0    5.0
5   tom   30   30   30     1      7   NaN   NaN   NaN   NaN   NaN    NaN
6  nick   19   19   19     1      2  nick  19.0  19.0  19.0   1.0    2.0
7  juli   24   24   29     2      5   NaN   NaN   NaN   NaN   NaN    NaN
8   tom    0    0    0     1      7   NaN   NaN   NaN   NaN   NaN    NaN
9  juli   76   76   76     1      5   NaN   NaN   NaN   NaN   NaN    NaN

For more bins:

bins = [(5,20), (25,40)]
out = pd.concat([df2]+[df2.where(df2['max'].between(a,b)) for a,b in bins], axis=1)

output:

   Name  Age  min  max  cons  count  Name   Age   min   max  cons  count  Name   Age   min   max  cons  count
0   tom   10    5   10     2      7   tom  10.0   5.0  10.0   2.0    7.0   NaN   NaN   NaN   NaN   NaN    NaN
1  nick   15   15   15     1      2  nick  15.0  15.0  15.0   1.0    2.0   NaN   NaN   NaN   NaN   NaN    NaN
2  juli   14   14   14     1      5  juli  14.0  14.0  14.0   1.0    5.0   NaN   NaN   NaN   NaN   NaN    NaN
3   tom   20   10   20     3      7   tom  20.0  10.0  20.0   3.0    7.0   NaN   NaN   NaN   NaN   NaN    NaN
4  juli   17   17   17     1      5  juli  17.0  17.0  17.0   1.0    5.0   NaN   NaN   NaN   NaN   NaN    NaN
5   tom   30   30   30     1      7   NaN   NaN   NaN   NaN   NaN    NaN   tom  30.0  30.0  30.0   1.0    7.0
6  nick   19   19   19     1      2  nick  19.0  19.0  19.0   1.0    2.0   NaN   NaN   NaN   NaN   NaN    NaN
7  juli   24   24   29     2      5   NaN   NaN   NaN   NaN   NaN    NaN  juli  24.0  24.0  29.0   2.0    5.0
8   tom    0    0    0     1      7   NaN   NaN   NaN   NaN   NaN    NaN   NaN   NaN   NaN   NaN   NaN    NaN
9  juli   76   76   76     1      5   NaN   NaN   NaN   NaN   NaN    NaN   NaN   NaN   NaN   NaN   NaN    NaN

Upvotes: 0

Related Questions