Reputation: 967

Get only the first and last rows of each group with pandas

I have huge a dataframe with millions of rows and id. My data looks like this:

Time    ID  X   Y
8:00    A   23  100
9:00    B   24  110
10:00   B   25  120
11:00   C   26  130
12:00   C   27  140
13:00   A   28  150
14:00   A   29  160
15:00   D   30  170
16:00   C   31  180
17:00   B   32  190
18:00   A   33  200
19:00   C   34  210
20:00   A   35  220
21:00   B   36  230
22:00   C   37  240
23:00   B   38  250

When I sort the data on id and time, the result looks like this:

Time    ID  X   Y
8:00    A   23  100
13:00   A   28  150
14:00   A   29  160
18:00   A   33  200
20:00   A   35  220
9:00    B   24  110
10:00   B   25  120
17:00   B   32  190
21:00   B   36  230
23:00   B   38  250
11:00   C   26  130
12:00   C   27  140
16:00   C   31  180
19:00   C   34  210
22:00   C   37  240
15:00   D   30  170

From here, I want to pick only "The first and the last" of the id and eliminate the rest. The expected result looks like this:

Time    ID  X   Y
8:00    A   23  100
20:00   A   35  220
9:00    B   24  110
23:00   B   38  250
11:00   C   26  130
22:00   C   37  240
15:00   D   30  170

How to do it in pandas?

Upvotes: 32

Answers (4)

Vinay

Reputation: 253

I had the same question but found the following answer on as well

Ensure your Time is actual Time and is correctly sorted

df['Time'] = pd.to_datetime(df['Time'],format="mixed")
Sort on ID/Time

df.sort_values(["ID","Time"],inplace=True)
use the nth function to get the required rows

df.groupby('ID').nth([0,-1])

Output

                  Time ID   X    Y
0  2025-01-11 08:00:00  A  23  100
12 2025-01-11 20:00:00  A  35  220
1  2025-01-11 09:00:00  B  24  110
15 2025-01-11 23:00:00  B  38  250
3  2025-01-11 11:00:00  C  26  130
14 2025-01-11 22:00:00  C  37  240
7  2025-01-11 15:00:00  D  30  170

Upvotes: 0

Akash Kumar

Reputation: 742

You can get the columns you want by simply sorting the 'ID' column. by:

df_sorted = df.sort_values("ID")

After that make an empty data frame of the same columns by searching blank:

all = df[df.ID=='']

Store all the unique values present in the 'ID' column by:

uni = list(df.ID.unique())

Then finally, append the first and last head and tail in that empty data frame that you created earlier.

Final code will look something like this:

df_sorted = df.sort_values("ID")
all = df[df.ID=='']
uni = list(df.ID.unique())
for x in uni:
  all = all.append(df[df.ID==x].head(1), ignore_index=True)
  all = all.append(df[df.ID==x].tail(1), ignore_index=True)
all

Upvotes: 0

johnnybarrels

Reputation: 374

If you create a small function to only select the first and last rows of a DataFrame, you can apply this to a group-by, like so:

df.groupby('ID').apply(lambda x: x.iloc[[0, -1]]).reset_index(drop=True)

As others have mentioned, it might be nice to also .drop_duplicates() or similar after the fact, to filter out duplicated rows for cases where there was only one row for the 'ID'.

Upvotes: 4

cs95

Reputation: 403128

Use groupby, find the head and tail for each group, and concat the two.

g = df.groupby('ID')

(pd.concat([g.head(1), g.tail(1)])
   .drop_duplicates()
   .sort_values('ID')
   .reset_index(drop=True))

    Time ID   X    Y
0   8:00  A  23  100
1  20:00  A  35  220
2   9:00  B  24  110
3  23:00  B  38  250
4  11:00  C  26  130
5  22:00  C  37  240
6  15:00  D  30  170

If you can guarantee each ID group has at least two rows, the drop_duplicates call is not needed.

Details

g.head(1)

    Time ID   X    Y
0   8:00  A  23  100
1   9:00  B  24  110
3  11:00  C  26  130
7  15:00  D  30  170

g.tail(1)

     Time ID   X    Y
7   15:00  D  30  170
12  20:00  A  35  220
14  22:00  C  37  240
15  23:00  B  38  250

pd.concat([g.head(1), g.tail(1)])

     Time ID   X    Y
0    8:00  A  23  100
1    9:00  B  24  110
3   11:00  C  26  130
7   15:00  D  30  170
7   15:00  D  30  170
12  20:00  A  35  220
14  22:00  C  37  240
15  23:00  B  38  250

Upvotes: 47

Get only the first and last rows of each group with pandas

Answers (4)

Related Questions