Reputation: 3663
I have a data frame that looks something like this.
License.Number, DateFormatted
A019, 2018-09-20
A019, 2018-09-21
A020, 2018-09-21
I want to remove rows with duplicate license numbers, but keep only the ones with the earliest DateFormatted
data.
How do I do this in R?
Upvotes: 2
Views: 39
Reputation: 6768
Here are two options (assuming DateFormatted
is a Date class):
## option 1: with base R
df[!duplicated(df[, "DateFormatted"], fromLast = FALSE), ]
# output
License.Number DateFormatted
1 A019 2018-09-20
2 A019 2018-09-21
## option 2: with dplyr
library(dplyr)
df %>%
group_by(License.Number) %>%
slice(which.min(DateFormatted))
# output
# A tibble: 2 x 2
# Groups: License.Number [2]
License.Number DateFormatted
<chr> <date>
1 A019 2018-09-20
2 A020 2018-09-21
Data
df <- structure(list(License.Number = c("A019", "A019", "A020"), DateFormatted = structure(c(17794,
17795, 17795), class = "Date")), .Names = c("License.Number",
"DateFormatted"), row.names = c(NA, -3L), class = "data.frame")
Upvotes: 2
Reputation: 50678
A tidyverse
option
library(tidyverse)
df %>%
mutate(DateFormatted = as.Date(DateFormatted)) %>%
arrange(License.Number, DateFormatted) %>%
group_by(License.Number) %>%
filter(row_number(License.Number) == 1)
## A tibble: 2 x 2
## Groups: License.Number [2]
# License.Number DateFormatted
# <fct> <date>
#1 A019 2018-09-20
#2 A020 2018-09-21
Or in base using duplicated
df$DateFormatted <- as.Date(df$DateFormatted)
df[order(df$License.Number, df$DateFormatted), ]
df[!duplicated(df$License.Number), ]
# License.Number DateFormatted
#1 A019 2018-09-20
#3 A020 2018-09-21
In both cases we ensure that DateFormatted
is a Date
object, sort rows by License.Number
and DateFormatted
(from earliest to latest) and then keep only the first entry per Licence.Number
.
df <- read.table(text =
"License.Number DateFormatted
A019 2018-09-20
A019 2018-09-21
A020 2018-09-21", header = T)
Upvotes: 2