Username
Username

Reputation: 3663

How to get only rows with ID where the date is earlier than another row with the same ID?

I have a data frame that looks something like this.

License.Number, DateFormatted
A019, 2018-09-20
A019, 2018-09-21
A020, 2018-09-21

I want to remove rows with duplicate license numbers, but keep only the ones with the earliest DateFormatted data.

How do I do this in R?

Upvotes: 2

Views: 39

Answers (2)

nghauran
nghauran

Reputation: 6768

Here are two options (assuming DateFormatted is a Date class):

## option 1: with base R
df[!duplicated(df[, "DateFormatted"], fromLast = FALSE), ]
# output
  License.Number DateFormatted
1           A019    2018-09-20
2           A019    2018-09-21
## option 2: with dplyr
library(dplyr)
df %>%
  group_by(License.Number) %>%
  slice(which.min(DateFormatted))
# output
# A tibble: 2 x 2
# Groups:   License.Number [2]
  License.Number DateFormatted
           <chr>        <date>
1           A019    2018-09-20
2           A020    2018-09-21

Data

df <- structure(list(License.Number = c("A019", "A019", "A020"), DateFormatted = structure(c(17794, 
17795, 17795), class = "Date")), .Names = c("License.Number", 
"DateFormatted"), row.names = c(NA, -3L), class = "data.frame")

Upvotes: 2

Maurits Evers
Maurits Evers

Reputation: 50678

A tidyverse option

library(tidyverse)
df %>%
    mutate(DateFormatted = as.Date(DateFormatted)) %>%
    arrange(License.Number, DateFormatted) %>%
    group_by(License.Number) %>%
    filter(row_number(License.Number) == 1)
## A tibble: 2 x 2
## Groups:   License.Number [2]
#  License.Number DateFormatted
#  <fct>          <date>
#1 A019           2018-09-20
#2 A020           2018-09-21

Or in base using duplicated

df$DateFormatted <- as.Date(df$DateFormatted)
df[order(df$License.Number, df$DateFormatted), ]
df[!duplicated(df$License.Number), ]
#  License.Number DateFormatted
#1           A019    2018-09-20
#3           A020    2018-09-21

In both cases we ensure that DateFormatted is a Date object, sort rows by License.Number and DateFormatted (from earliest to latest) and then keep only the first entry per Licence.Number.


Sample data

df <- read.table(text =
    "License.Number DateFormatted
A019 2018-09-20
A019 2018-09-21
A020 2018-09-21", header = T)

Upvotes: 2

Related Questions