Reputation: 967
I have huge a dataframe with millions of rows and id. My data looks like this:
Time ID X Y
8:00 A 23 100
9:00 B 24 110
10:00 B 25 120
11:00 C 26 130
12:00 C 27 140
13:00 A 28 150
14:00 A 29 160
15:00 D 30 170
16:00 C 31 180
17:00 B 32 190
18:00 A 33 200
19:00 C 34 210
20:00 A 35 220
21:00 B 36 230
22:00 C 37 240
23:00 B 38 250
When I sort the data on id and time, the result looks like this:
Time ID X Y
8:00 A 23 100
13:00 A 28 150
14:00 A 29 160
18:00 A 33 200
20:00 A 35 220
9:00 B 24 110
10:00 B 25 120
17:00 B 32 190
21:00 B 36 230
23:00 B 38 250
11:00 C 26 130
12:00 C 27 140
16:00 C 31 180
19:00 C 34 210
22:00 C 37 240
15:00 D 30 170
From here, I want to pick only "The first and the last" of the id and eliminate the rest. The expected result looks like this:
Time ID X Y
8:00 A 23 100
20:00 A 35 220
9:00 B 24 110
23:00 B 38 250
11:00 C 26 130
22:00 C 37 240
15:00 D 30 170
How to do it in pandas?
Upvotes: 32
Views: 34205
Reputation: 253
I had the same question but found the following answer on as well
Ensure your Time is actual Time and is correctly sorted
df['Time'] = pd.to_datetime(df['Time'],format="mixed")
Sort on ID/Time
df.sort_values(["ID","Time"],inplace=True)
use the nth function to get the required rows
df.groupby('ID').nth([0,-1])
Output
Time ID X Y
0 2025-01-11 08:00:00 A 23 100
12 2025-01-11 20:00:00 A 35 220
1 2025-01-11 09:00:00 B 24 110
15 2025-01-11 23:00:00 B 38 250
3 2025-01-11 11:00:00 C 26 130
14 2025-01-11 22:00:00 C 37 240
7 2025-01-11 15:00:00 D 30 170
Upvotes: 0
Reputation: 732
You can get the columns you want by simply sorting the 'ID' column. by:
df_sorted = df.sort_values("ID")
After that make an empty data frame of the same columns by searching blank:
all = df[df.ID=='']
Store all the unique values present in the 'ID' column by:
uni = list(df.ID.unique())
Then finally, append the first and last head and tail in that empty data frame that you created earlier.
Final code will look something like this:
df_sorted = df.sort_values("ID")
all = df[df.ID=='']
uni = list(df.ID.unique())
for x in uni:
all = all.append(df[df.ID==x].head(1), ignore_index=True)
all = all.append(df[df.ID==x].tail(1), ignore_index=True)
all
Upvotes: 0
Reputation: 374
If you create a small function to only select the first and last rows of a DataFrame, you can apply this to a group-by, like so:
df.groupby('ID').apply(lambda x: x.iloc[[0, -1]]).reset_index(drop=True)
As others have mentioned, it might be nice to also .drop_duplicates()
or similar after the fact, to filter out duplicated rows for cases where there was only one row for the 'ID'.
Upvotes: 4
Reputation: 402393
Use groupby
, find the head
and tail
for each group, and concat
the two.
g = df.groupby('ID')
(pd.concat([g.head(1), g.tail(1)])
.drop_duplicates()
.sort_values('ID')
.reset_index(drop=True))
Time ID X Y
0 8:00 A 23 100
1 20:00 A 35 220
2 9:00 B 24 110
3 23:00 B 38 250
4 11:00 C 26 130
5 22:00 C 37 240
6 15:00 D 30 170
If you can guarantee each ID group has at least two rows, the drop_duplicates
call is not needed.
Details
g.head(1)
Time ID X Y
0 8:00 A 23 100
1 9:00 B 24 110
3 11:00 C 26 130
7 15:00 D 30 170
g.tail(1)
Time ID X Y
7 15:00 D 30 170
12 20:00 A 35 220
14 22:00 C 37 240
15 23:00 B 38 250
pd.concat([g.head(1), g.tail(1)])
Time ID X Y
0 8:00 A 23 100
1 9:00 B 24 110
3 11:00 C 26 130
7 15:00 D 30 170
7 15:00 D 30 170
12 20:00 A 35 220
14 22:00 C 37 240
15 23:00 B 38 250
Upvotes: 47