aju_k
aju_k

Reputation: 183

R delete last row in dataframe for each group

I would like to delete the last row in a dataframe for each group in R based on max(start_date).

Example data:

id      start_date  end_date
1       2016-01-14  2016-02-14
1       2016-03-14  2016-08-05
2       2014-01-14  2014-02-14
2       2015-03-21  2015-05-21
2       2015-08-23  2015-09-23
2       2015-11-21  2016-01-03

Result:

id      start_date  end_date
1       2016-01-14  2016-02-14
2       2014-01-14  2014-02-14
2       2015-03-21  2015-05-21
2       2015-08-23  2015-09-23

The following does not work:

df <- df %>% 
   group_by(id) %>% 
   summarise(start_date != max(start_date))

Error: found duplicated column name: id

df <- sqldf("select * from df group by id having start_date != max(start_date)")

error in statement: duplicate column name: id

Any suggestions would be great.

Upvotes: 2

Views: 7434

Answers (1)

akrun
akrun

Reputation: 887048

We can use slice (assuming that the dates are already ordered)

df1 %>% 
   group_by(id) %>% 
   slice(-n())
#     id start_date   end_date
#   <int>      <chr>      <chr>
#1     1 2016-01-14 2016-02-14
#2     2 2014-01-14 2014-02-14
#3     2 2015-03-21 2015-05-21
#4     2 2015-08-23 2015-09-23

If the dates are not ordered, then arrange and slice

df1 %>%
   group_by(id) %>%
   arrange(start_date) %>%
   slice(-n()) 

Based on some previous benchmarks (couldn't find the link), the arrange/slice method would be faster than comparing start_date != max(start_date)

Upvotes: 10

Related Questions