msksantosh
msksantosh

Reputation: 399

Groupby and Sample pandas

I am trying to sample the resulting data after doing a groupby on multiple columns. If the respective groupby has more than 2 elements, I want to take sample 2 records, else take all the records

df:

col1   col2   col3   col4
A1     A2     A3     A4
A1     A2     A3     A5
A1     A2     A3     A6
B1     B2     B3     B4
B1     B2     B3     B5
C1     C2     C3     C4

target df:

col1   col2   col3   col4
A1     A2     A3     A4 or A5 or A6
A1     A2     A3     A4 or A5 or A6
B1     B2     B3     B4
B1     B2     B3     B5
C1     C2     C3     C4

I have mentioned A4 or A5 or A6 because, when we take sample, either of the three might return

This is what i have tried so far:

trial = pd.DataFrame(df.groupby(['col1', 'col2','col3'])['col4'].apply(lambda x: x if (len(x) <=2) else x.sample(2)))

However, in this I do not get col1, col2 and col3

Upvotes: 1

Views: 1545

Answers (2)

jezrael
jezrael

Reputation: 862681

I think need double reset_index - first for remove 3.rd level of MultiIndex and second for convert MultiIndex to columns:

trial= (df.groupby(['col1', 'col2','col3'])['col4']
        .apply(lambda x: x if (len(x) <=2) else x.sample(2))
        .reset_index(level=3, drop=True)
        .reset_index())

Or reset_index with drop for remove column level_3:

trial= (df.groupby(['col1', 'col2','col3'])['col4']
        .apply(lambda x: x if (len(x) <=2) else x.sample(2))
        .reset_index()
        .drop('level_3', 1))

print (trial)
  col1 col2 col3 col4
0   A1   A2   A3   A4
1   A1   A2   A3   A6
2   B1   B2   B3   B4
3   B1   B2   B3   B5
4   C1   C2   C3   C4

Upvotes: 1

pratiklodha
pratiklodha

Reputation: 1115

There is no need to convert this to a pandas dataframe its one by default

trial=df.groupby(['col1', 'col2','col3'])['col4'].apply(lambda x: x if (len(x) <=2) else x.sample(2))

And this should add the col1,2,3

trial.reset_index(inplace=True,drop=False)

Upvotes: 0

Related Questions