user63230
user63230

Reputation: 4708

remove first occurrence of duplicate column names data.table

What is the cleanest way to remove columns with duplicate names with the caveat that I want to keep the second occurrence (or another way, remove the first occurrence)?

Given:

library(data.table)
dt <- structure(list(CERT_NUMBER = c(999, NA, NA), FORENAME = c("JOHN", 
                                                                NA, NA), SURNAME = c("JOHNSON", NA, NA), START_DATE = structure(c(16801L, 
                                                                                                                                  NA, NA), class = c("IDate", "Date")), EXPIRY_DATE = structure(c(17166L, 
                                                                                                                                                                                                  NA, NA), class = c("IDate", "Date")), ID = c(1, 2, 3), FORENAME = c("JOHN", 
                                                                                                                                                                                                                                                                      "JACK", "ROB"), SURNAME = c("JOHNSON", "JACKSON", "ROBINSON"), 
                     MONTH = structure(c(16953L, 16953L, 16953L), class = c("IDate", 
                                                                            "Date"))), row.names = c(NA, -3L), class = c("data.table", 
                                                                                                                         "data.frame"))
dt
#    CERT_NUMBER FORENAME SURNAME START_DATE EXPIRY_DATE ID FORENAME  SURNAME      MONTH
# 1:         999     JOHN JOHNSON 2016-01-01  2016-12-31  1     JOHN  JOHNSON 2016-06-01
# 2:          NA     <NA>    <NA>       <NA>        <NA>  2     JACK  JACKSON 2016-06-01
# 3:          NA     <NA>    <NA>       <NA>        <NA>  3      ROB ROBINSON 2016-06-01

I want to keep the second occurrence of duplicate column names to be left with:

#    CERT_NUMBER START_DATE EXPIRY_DATE ID FORENAME  SURNAME      MONTH
# 1:         999 2016-01-01  2016-12-31  1     JOHN  JOHNSON 2016-06-01
# 2:          NA       <NA>        <NA>  2     JACK  JACKSON 2016-06-01
# 3:          NA       <NA>        <NA>  3      ROB ROBINSON 2016-06-01

If we didn't care about the order of the duplicates, we could do the following which keeps the first duplicates which is not what I want:

dt[, .SD, .SDcols = unique(names(dt))]
#    CERT_NUMBER FORENAME SURNAME START_DATE EXPIRY_DATE ID      MONTH
# 1:         999     JOHN JOHNSON 2016-01-01  2016-12-31  1 2016-06-01
# 2:          NA     <NA>    <NA>       <NA>        <NA>  2 2016-06-01
# 3:          NA     <NA>    <NA>       <NA>        <NA>  3 2016-06-01

thanks

Upvotes: 2

Views: 528

Answers (2)

mt1022
mt1022

Reputation: 17299

Here is a more flexible way:

g <- as.integer(ave(names(dt), names(dt), FUN = length))

# for duplicated column names, keep the 1st occurrence
dt[, g == 1 | (rowid(names(dt)) == 1), with = FALSE]

# keep the 2nd occurrence
dt[, g == 1 | (rowid(names(dt)) == 2), with = FALSE]

# keep the 2nd and 3rd occurrences
dt[, g == 1 | (rowid(names(dt)) %in% c(2, 3)), with = FALSE]

# keep the last occurrence
dt[, g == rowid(names(dt)), with = FALSE]

Upvotes: 2

RobertoT
RobertoT

Reputation: 1683

If the repeated columns are only repeated 2 times you can try duplicated() with fromlast=TRUE argument:

dt[, .SD, .SDcols = ! duplicated(colnames(dt),fromLast=TRUE)]

   CERT_NUMBER START_DATE EXPIRY_DATE ID FORENAME  SURNAME      MONTH
1:         999 2016-01-01  2016-12-31  1     JOHN  JOHNSON 2016-06-01
2:          NA       <NA>        <NA>  2     JACK  JACKSON 2016-06-01
3:          NA       <NA>        <NA>  3      ROB ROBINSON 2016-06-01

Upvotes: 3

Related Questions