Reputation: 1304

Deleting duplicated rows based on condition (position)

I have a dataset that looks something like this

df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"), 
                 "Year" = c(1970,1970,1970,1971,1980,1980,1981,1982), 
                 "Val" = c(2,3,-2,5,2,5,3,5))

I have mulple observations for each id and time identifier - e.g. I have 3 different alpha 1970 values. I would like to retain only one observation per id/year most notably the last one that appears in for each id/year. the final dataset should look something like this:

final <- data.frame("id" = c("Alpha","Alpha","Beta","Beta","Beta"), 
                    "Year" = c(1970,1971,1980,1981,1982), 
                    "Val" = c(-2,5,5,3,5))

Does anyone know how I can approach the problem?

Thanks a lot in advance for your help

Upvotes: 1

Answers (3)

DHW

Reputation: 1267

How about this?

library(tidyverse)

df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"), 
                 "Year" = c(1970,1970,1970,1971,1980,1980,1981,1982), 
                 "Val" = c(2,3,-2,5,2,5,3,5))

final <- 
  df %>% 
  group_by(id, Year) %>% 
  slice(n()) %>% 
  ungroup()

final
#> # A tibble: 5 x 3
#>   id     Year   Val
#>   <fct> <dbl> <dbl>
#> 1 Alpha  1970    -2
#> 2 Alpha  1971     5
#> 3 Beta   1980     5
#> 4 Beta   1981     3
#> 5 Beta   1982     5

^{Created on 2019-09-29 by the reprex package (v0.3.0)}

Translates to "within each id-Year group, take only the row where the row number is equal to the size of the group, i.e. it's the last row under the current ordering."

You could also use either filter(), e.g. filter(row_number() == n()), or distinct() (and then you wouldn't even have to group), e.g. distinct(id, Year, .keep_all = TRUE) - but distinct functions take the first distinct row, so you'd need to reverse the row ordering here first.

Upvotes: 1

akrun

Reputation: 887203

An option with base R

aggregate(Val ~ ., df, tail, 1)
#     id Year Val
#1 Alpha 1970  -2
#2 Alpha 1971   5
#3  Beta 1980   5
#4  Beta 1981   3
#5  Beta 1982   5

If we need to select the first row

aggregate(Val ~ ., df, head, 1)

Upvotes: 1

Joris C.

Reputation: 6234

If you are open to a data.table solution, this can be done quite concisely:

library(data.table)

setDT(df)[, .SD[.N], by = c("id", "Year")]
#>       id Year Val
#> 1: Alpha 1970  -2
#> 2: Alpha 1971   5
#> 3:  Beta 1980   5
#> 4:  Beta 1981   3
#> 5:  Beta 1982   5

by = c("id", "Year") groups the data.table by id and Year, and .SD[.N] then returns the last row within each such group.

Upvotes: 2

Deleting duplicated rows based on condition (position)

Answers (3)

Related Questions