remove duplicate row values across columns in a large data.frame

Question

I have a data.frame

ID  code1   code2  code3
A    143     143    144
A    35      453     35
A             35     15
B    46      46      45
B    12      43     765
C    255     455     344
C    343     343     343
C    343     23      23

each code appears in one time

the id may be repeated. The real data.frame is very large

ID  code1   code2  code3
A    143             144
A    35      453     
A             35     15
B    46              45
B    12      43      765
C    255     455     344
C    343          
C    343     23

thanks

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer

This solution is likely to be somewhat inefficient, but that is mostly because of transforming the data back and forth between wide and long and wide. However, you might find it easier to work with your data in a "long" form.

First, generate a second ID, since you have IDs spanning multiple rows.

mydf$ID2 <- with(mydf, ave(ID, ID, FUN = seq_along))

Second, use melt from the "reshape2" package to make your data into a long form.

library(reshape2)
m.df <- melt(mydf, id.vars=c("ID", "ID2"))

With the data in its long form, it is much easier to identify duplicates and replace them with NA.

m.df[duplicated(m.df[setdiff(names(m.df), "variable")]), "value"] <- NA

If you are happy with your data in the long form. Stop there. If you want to get it back to its wide form, use dcast (again from "reshape2").

dcast(m.df, ID + ID2 ~ variable)
#   ID ID2 code1 code2 code3
# 1  A   1   143    NA   144
# 2  A   2    35   453    NA
# 3  A   3    NA    35    15
# 4  B   1    46    NA    45
# 5  B   2    12    43   765
# 6  C   1   255   455   344
# 7  C   2   343    NA    NA
# 8  C   3   343    23    NA

For reference, this is also doable in base R, but the syntax is more clumsy (even though it might perform better than the "reshape2" equivalents).

mydf$ID2 <- with(mydf, ave(ID, ID, FUN = seq_along))
m.df <- cbind(mydf[c("ID", "ID2")], 
              stack(mydf[setdiff(names(mydf), c("ID", "ID2"))]))
m.df[duplicated(m.df[setdiff(names(m.df), "ind")]), "values"] <- NA
cbind(mydf[c("ID", "ID2")], unstack(m.df, values ~ ind))

Update: A possible `data.table` solution

You may want to explore data.table since you mention that your data are large. Here's one possible solution (though @Arun might have a more direct solution to share).

library(data.table)
DT <- data.table(mydf)

## Creates your long data.table
temp <- DT[, list(ID2 = 1:.N, Value = unlist(.SD)), by ="ID"]
## Changes duplicates to NA and adds in the "Code" column
temp[duplicated(temp), Value := NA][, Variable := rep(names(DT)[-1], 
                                                      each = nrow(DT))]
## "Reshapes" the data from long to wide
temp[, as.list(setattr(Value, 'names', Variable)), by=list(ID, ID2)]
#    ID ID2 code1 code1 code1
# 1:  A   1   143    NA   144
# 2:  A   2    35   453    NA
# 3:  A   3    NA    35    15
# 4:  B   1    46    NA    45
# 5:  B   2    12    43   765
# 6:  C   1   255   455   344
# 7:  C   2   343    NA    NA
# 8:  C   3   343    23    NA

remove duplicate row values across columns in a large data.frame

Answers (2)

Benchmarking results:

Creating Data:

My first version:

My second version:

Ananda's `ave` + `reshape2` solution:

Ananda's `data.table` solution:

Update: A possible `data.table` solution

Related Questions

remove duplicate row values across columns in a large data.frame

Answers (2)

Benchmarking results:

Creating Data:

My first version:

My second version:

Ananda's ave + reshape2 solution:

Ananda's data.table solution:

Update: A possible data.table solution

Related Questions

Ananda's `ave` + `reshape2` solution:

Ananda's `data.table` solution:

Update: A possible `data.table` solution