Shlomi Schwartz
Shlomi Schwartz

Reputation: 8903

Pandas - create a new DataFrame from first n groups of a groupby operation

Having the following DF:

   A   B
0  1  11
1  2  22
2  2  22
3  3  33
4  3  33

I would like to groupby 'A' then take first n groups and create a new data frame from it. I've looked around and found this answer:

result = [g[1] for g in list(grouped)[:3]]

But the solution returns a list and not a DF, furthermore it seems redundant to create a list from the grouped result.

Update: Expected output is a new DF comprised from the first n groups, for example if n=2 output would be:

   A   B
0  1  11 <-- first group
1  2  22 <-- second group
2  2  22 <-- second group

Any help would be appreciated

Upvotes: 2

Views: 2101

Answers (3)

9769953
9769953

Reputation: 12201

Technically, you can't: the groups aren't necessarily in the order your dataframe is: the grouped result in sorted by the group-by column (by default, this can be turned off), and that then defines the order. In other words, the individual groups should be accessed using the values from the grouped column (A here).

In your case, this may work:

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 2, 3, 3], 'B': [11, 22, 22, 33, 33]})
grouped = df.groupby('A')
n = 2
df = pd.concat([group for name, group in grouped][:n])
print(df)

which yields

   A   B
0  1  11
1  2  22
2  2  22

But if the input dataframe is the following (note the order of values in the columns):

import pandas as pd

df = pd.DataFrame({'A': [2, 2, 3, 3, 1], 'B': [22, 22, 33, 33, 11]})
grouped = df.groupby('A')
n = 2
df = pd.concat([group for name, group in grouped][:n])
print(df)

the first two grouped concatenated will still be

   A   B
4  1  11
0  2  22
1  2  22

because the groups are sorted by values in column 'A'. (Note how the values are as before; the index, however, is different.)

So there is no real "first n elements" for a set of groupby results.

Upvotes: 2

Ch3steR
Ch3steR

Reputation: 20669

We can use pd.factorize here with df.isin

ids = pd.factorize(df['B'])[1]
n = 2 # Take first two groups
m = df['B'].isin(ids[:n])
df.loc[m]

   A   B
0  1  11
1  2  22
2  2  22

Output when n=1

ids = pd.factorize(df['B'])[1]
n = 1 # Take first group
m = df['B'].isin(ids[:n])
df.loc[m]

   A   B
0  1  11

Upvotes: 1

sammywemmy
sammywemmy

Reputation: 28644

You could get the indices and create a new dataframe with that;

grouped = df.groupby('A')

Assume n = 2

indices = pd.Index.union(*[value 
                           for key, value in grouped.groups.items() 
                           if key in [*grouped.groups][:2]]
                         )

 indices
 Int64Index([0, 1, 2], dtype='int64')

 df.loc[indices]


   A   B
0  1  11
1  2  22
2  2  22

Note also that you can sort the grouping if you want the data in a particular order; if sort is False, it will return n groups based on the existing order as they appear in the dataframe.

Upvotes: 0

Related Questions