Florestan
Florestan

Reputation: 127

Check & remove variable duplicates within one observation

I'm facing the following data.table that turns out to have duplicate variables but just for a single observation.

Consider the minimum example:

DT = data.table(Firm = c("Firm1", "Firm2", "Firm3", "Firm4"), Customer1=c("Alice", "Bob", "Alice", "Bob"), ID1=c("1", "2", "1", "2"), Customer2=c("Charly", "Sarah", "Alicia", "Jack"), ID2=c("3", "4", "1", "5"), Customer3=c("Kevin", "Sara", "Deborah", "NA"), ID3=c("6", "4", "7", "NA"))

Firm    Customer1    ID1    Customer2    ID2    Customer3    ID3

Firm1     Alice       1       Charly      3       Kevin       6
Firm2       Bob       2       Sarah       4       Sara        4
Firm3     Alice       1       Alicia      1      Deborah      7
Firm4       Bob       2       Jack        5         NA        NA

Basically, the issue in the data is that the same customer might be considered twice because the name is misspelt (consider Alice/Alicia and Sara/Sarah in the minimum example) but the ID tells me, that it's the same customer. So I want to clean the data to delete Customer + ID, in case the ID has already shown up within the same observation. The final DT would ideally look like

Firm    Customer1    ID1    Customer2    ID2    Customer3    ID3

Firm1     Alice       1       Charly      3       Kevin       6
Firm2       Bob       2       Sarah       4       NA          NA
Firm3     Alice       1       NA          NA      Deborah     7
Firm4       Bob       2       Jack        5       NA          NA

or even better:

Firm    Customer1    ID1    Customer2    ID2    Customer3    ID3

Firm1     Alice       1       Charly      3       Kevin       6
Firm2       Bob       2       Sarah       4       NA          NA
Firm3     Alice       1       Deborah     7       NA          NA
Firm4       Bob       2       Jack        5       NA          NA

The dataset is quite large, so if possible I would like to avoid to have to loop through each row and compare several combinations of Customer and IDs. Does somebody have an idea of an efficient solution that I am not aware of?

Upvotes: 1

Views: 178

Answers (1)

Jaap
Jaap

Reputation: 83275

In the case you describe, I would melt the data to long format and then remove duplicates by Firm and ID with the unique-function, add a new rowid for each Firm and finally reshape it to wide format again with dcast.

Using:

DT.l <- melt(DT, id = 1, measure.vars = list(c(2,4,6), c(3,5,7)),
             value.name = c('Customer','ID'))

DT.w <- dcast(unique(DT.l, by = c('Firm','ID'))[, variable := rowid(Firm)],
              Firm ~ variable, value.var = c('Customer','ID'))

setcolorder(DT.w, c(1:2,5,3,6,4,7))

gives:

> DT.w
    Firm Customer_1 ID_1 Customer_2 ID_2 Customer_3 ID_3
1: Firm1      Alice    1     Charly    3      Kevin    6
2: Firm2        Bob    2      Sarah    4         NA   NA
3: Firm3      Alice    1    Deborah    7         NA   NA
4: Firm4        Bob    2       Jack    5         NA   NA

Notes:

  • Instead of specifying the column positions in measure.vars, you can also match by pattern in the column names: measure.vars = patterns('Customer',"ID").
  • I used setcolorder to set the columns in the order as showed in the desired output, but this is off course not necessarily needed.

Upvotes: 3

Related Questions