How to get only rows with ID where the date is earlier than another row with the same ID?

Question

I have a data frame that looks something like this.

License.Number, DateFormatted
A019, 2018-09-20
A019, 2018-09-21
A020, 2018-09-21

I want to remove rows with duplicate license numbers, but keep only the ones with the earliest DateFormatted data.

How do I do this in R?

Maurits Evers · Accepted Answer

A tidyverse option

library(tidyverse)
df %>%
    mutate(DateFormatted = as.Date(DateFormatted)) %>%
    arrange(License.Number, DateFormatted) %>%
    group_by(License.Number) %>%
    filter(row_number(License.Number) == 1)
## A tibble: 2 x 2
## Groups:   License.Number [2]
#  License.Number DateFormatted
#            
#1 A019           2018-09-20
#2 A020           2018-09-21

Or in base using duplicated

df$DateFormatted <- as.Date(df$DateFormatted)
df[order(df$License.Number, df$DateFormatted), ]
df[!duplicated(df$License.Number), ]
#  License.Number DateFormatted
#1           A019    2018-09-20
#3           A020    2018-09-21

In both cases we ensure that DateFormatted is a Date object, sort rows by License.Number and DateFormatted (from earliest to latest) and then keep only the first entry per Licence.Number.

Sample data

df <- read.table(text =
    "License.Number DateFormatted
A019 2018-09-20
A019 2018-09-21
A020 2018-09-21", header = T)

How to get only rows with ID where the date is earlier than another row with the same ID?

Answers (2)

Sample data

Related Questions