Reputation: 127
I'm facing the following data.table that turns out to have duplicate variables but just for a single observation.
Consider the minimum example:
DT = data.table(Firm = c("Firm1", "Firm2", "Firm3", "Firm4"), Customer1=c("Alice", "Bob", "Alice", "Bob"), ID1=c("1", "2", "1", "2"), Customer2=c("Charly", "Sarah", "Alicia", "Jack"), ID2=c("3", "4", "1", "5"), Customer3=c("Kevin", "Sara", "Deborah", "NA"), ID3=c("6", "4", "7", "NA"))
Firm Customer1 ID1 Customer2 ID2 Customer3 ID3
Firm1 Alice 1 Charly 3 Kevin 6
Firm2 Bob 2 Sarah 4 Sara 4
Firm3 Alice 1 Alicia 1 Deborah 7
Firm4 Bob 2 Jack 5 NA NA
Basically, the issue in the data is that the same customer might be considered twice because the name is misspelt (consider Alice/Alicia and Sara/Sarah in the minimum example) but the ID tells me, that it's the same customer. So I want to clean the data to delete Customer + ID, in case the ID has already shown up within the same observation. The final DT would ideally look like
Firm Customer1 ID1 Customer2 ID2 Customer3 ID3
Firm1 Alice 1 Charly 3 Kevin 6
Firm2 Bob 2 Sarah 4 NA NA
Firm3 Alice 1 NA NA Deborah 7
Firm4 Bob 2 Jack 5 NA NA
or even better:
Firm Customer1 ID1 Customer2 ID2 Customer3 ID3
Firm1 Alice 1 Charly 3 Kevin 6
Firm2 Bob 2 Sarah 4 NA NA
Firm3 Alice 1 Deborah 7 NA NA
Firm4 Bob 2 Jack 5 NA NA
The dataset is quite large, so if possible I would like to avoid to have to loop through each row and compare several combinations of Customer and IDs. Does somebody have an idea of an efficient solution that I am not aware of?
Upvotes: 1
Views: 178
Reputation: 83275
In the case you describe, I would melt
the data to long format and then remove duplicates by Firm
and ID
with the unique
-function, add a new rowid
for each Firm
and finally reshape it to wide format again with dcast
.
Using:
DT.l <- melt(DT, id = 1, measure.vars = list(c(2,4,6), c(3,5,7)),
value.name = c('Customer','ID'))
DT.w <- dcast(unique(DT.l, by = c('Firm','ID'))[, variable := rowid(Firm)],
Firm ~ variable, value.var = c('Customer','ID'))
setcolorder(DT.w, c(1:2,5,3,6,4,7))
gives:
> DT.w Firm Customer_1 ID_1 Customer_2 ID_2 Customer_3 ID_3 1: Firm1 Alice 1 Charly 3 Kevin 6 2: Firm2 Bob 2 Sarah 4 NA NA 3: Firm3 Alice 1 Deborah 7 NA NA 4: Firm4 Bob 2 Jack 5 NA NA
Notes:
measure.vars
, you can also match by pattern in the column names: measure.vars = patterns('Customer',"ID")
.setcolorder
to set the columns in the order as showed in the desired output, but this is off course not necessarily needed.Upvotes: 3