Reputation: 4708
What is the cleanest way to remove columns with duplicate names with the caveat that I want to keep the second occurrence (or another way, remove the first occurrence)?
Given:
library(data.table)
dt <- structure(list(CERT_NUMBER = c(999, NA, NA), FORENAME = c("JOHN",
NA, NA), SURNAME = c("JOHNSON", NA, NA), START_DATE = structure(c(16801L,
NA, NA), class = c("IDate", "Date")), EXPIRY_DATE = structure(c(17166L,
NA, NA), class = c("IDate", "Date")), ID = c(1, 2, 3), FORENAME = c("JOHN",
"JACK", "ROB"), SURNAME = c("JOHNSON", "JACKSON", "ROBINSON"),
MONTH = structure(c(16953L, 16953L, 16953L), class = c("IDate",
"Date"))), row.names = c(NA, -3L), class = c("data.table",
"data.frame"))
dt
# CERT_NUMBER FORENAME SURNAME START_DATE EXPIRY_DATE ID FORENAME SURNAME MONTH
# 1: 999 JOHN JOHNSON 2016-01-01 2016-12-31 1 JOHN JOHNSON 2016-06-01
# 2: NA <NA> <NA> <NA> <NA> 2 JACK JACKSON 2016-06-01
# 3: NA <NA> <NA> <NA> <NA> 3 ROB ROBINSON 2016-06-01
I want to keep the second occurrence of duplicate column names to be left with:
# CERT_NUMBER START_DATE EXPIRY_DATE ID FORENAME SURNAME MONTH
# 1: 999 2016-01-01 2016-12-31 1 JOHN JOHNSON 2016-06-01
# 2: NA <NA> <NA> 2 JACK JACKSON 2016-06-01
# 3: NA <NA> <NA> 3 ROB ROBINSON 2016-06-01
If we didn't care about the order of the duplicates, we could do the following which keeps the first duplicates which is not what I want:
dt[, .SD, .SDcols = unique(names(dt))]
# CERT_NUMBER FORENAME SURNAME START_DATE EXPIRY_DATE ID MONTH
# 1: 999 JOHN JOHNSON 2016-01-01 2016-12-31 1 2016-06-01
# 2: NA <NA> <NA> <NA> <NA> 2 2016-06-01
# 3: NA <NA> <NA> <NA> <NA> 3 2016-06-01
thanks
Upvotes: 2
Views: 528
Reputation: 17299
Here is a more flexible way:
g <- as.integer(ave(names(dt), names(dt), FUN = length))
# for duplicated column names, keep the 1st occurrence
dt[, g == 1 | (rowid(names(dt)) == 1), with = FALSE]
# keep the 2nd occurrence
dt[, g == 1 | (rowid(names(dt)) == 2), with = FALSE]
# keep the 2nd and 3rd occurrences
dt[, g == 1 | (rowid(names(dt)) %in% c(2, 3)), with = FALSE]
# keep the last occurrence
dt[, g == rowid(names(dt)), with = FALSE]
Upvotes: 2
Reputation: 1683
If the repeated columns are only repeated 2 times you can try duplicated()
with fromlast=TRUE
argument:
dt[, .SD, .SDcols = ! duplicated(colnames(dt),fromLast=TRUE)]
CERT_NUMBER START_DATE EXPIRY_DATE ID FORENAME SURNAME MONTH
1: 999 2016-01-01 2016-12-31 1 JOHN JOHNSON 2016-06-01
2: NA <NA> <NA> 2 JACK JACKSON 2016-06-01
3: NA <NA> <NA> 3 ROB ROBINSON 2016-06-01
Upvotes: 3