Alexey Ferapontov
Alexey Ferapontov

Reputation: 5169

Remove common elements in Data Frame

Starting a separate thread as question is slightly different now(R: split data frame rows by space, remove common elements, put unequal length columns in new df). I have a data frame with arbitrary number of columns, and want to remove ALL elements that are not unique to any of the columns. Suggestion was to use intersect but it only removes elements that are present in all columns (see below). I need to remove any element that is seen in more than 1 column. And need a vectorized solution - as right now I can do it but really tediously working with N vectors. Thanks!

This one does the job but only for element that is seen in every column:

df1 = structure(list(A = structure(1:3, .Label = c("R1", "R2", "R3"), class = "factor"), 
                    B = c("1 4 78 5 4 6 7 0", 
                          "2 3 76 8 2 1 8 0", 
                          "4 7 1 2"
                    )), .Names = c("A", "B"), row.names = c(NA, -3L), class = "data.frame")


s <- strsplit(df1$B, " ")
## find the intersection of all s
r <- Reduce(intersect, s)
## iterate over s, removing the intersection characters in r
l <- lapply(s, function(x) x[!x %in% r])
## reset the length of each vector in l to the length of the longest vector
## then create the new data frame
zz = setNames(as.data.frame(lapply(l, "length<-", max(sapply(l, length)))), letters[seq_along(l)])

Edit. My apologies - should have included desired output. Here it is:

Col1 Col2 Col3 
78 3 NA
5  76 NA
6  8 NA
NA 8 NA

Upvotes: 1

Views: 1721

Answers (2)

mra68
mra68

Reputation: 2960

Perhaps

s <- strsplit(df1$B, " ")
n <- max(sapply(s,length))
M <- sapply(s,function(x){c(x,rep(Inf,n-length(x)))})
u <- unique(unlist(s))
r <- u[sapply(u,function(x){sum(rowSums(M==x)>0)>1})]

Then

> r
[1] "1" "4" "7" "2" "8"

are the elements that have to be removed. "Inf" is used to fill the gaps in the matrix "M" with something that doesn't appear in "df1$B". The matrix "M" is transpose to "df1$B". Therefore I used "rowSums" to check if an element appears in a column of "df1$B". If the strings in "df1$B" are meant to be columns, replace "rowSums" by "colSums".

Upvotes: 0

Rorschach
Rorschach

Reputation: 32446

You can make a table of unique values from each list and remove those with counts greater than 1.

tab <- table(unlist(sapply(s, unique))) < 2
lapply(s, function(x) x[tab[x]])

Upvotes: 1

Related Questions