Reputation: 5169
Starting a separate thread as question is slightly different now(R: split data frame rows by space, remove common elements, put unequal length columns in new df). I have a data frame with arbitrary number of columns, and want to remove ALL elements that are not unique to any of the columns. Suggestion was to use intersect
but it only removes elements that are present in all columns (see below). I need to remove any element that is seen in more than 1 column. And need a vectorized solution - as right now I can do it but really tediously working with N vectors.
Thanks!
This one does the job but only for element that is seen in every column:
df1 = structure(list(A = structure(1:3, .Label = c("R1", "R2", "R3"), class = "factor"),
B = c("1 4 78 5 4 6 7 0",
"2 3 76 8 2 1 8 0",
"4 7 1 2"
)), .Names = c("A", "B"), row.names = c(NA, -3L), class = "data.frame")
s <- strsplit(df1$B, " ")
## find the intersection of all s
r <- Reduce(intersect, s)
## iterate over s, removing the intersection characters in r
l <- lapply(s, function(x) x[!x %in% r])
## reset the length of each vector in l to the length of the longest vector
## then create the new data frame
zz = setNames(as.data.frame(lapply(l, "length<-", max(sapply(l, length)))), letters[seq_along(l)])
Edit. My apologies - should have included desired output. Here it is:
Col1 Col2 Col3
78 3 NA
5 76 NA
6 8 NA
NA 8 NA
Upvotes: 1
Views: 1721
Reputation: 2960
Perhaps
s <- strsplit(df1$B, " ")
n <- max(sapply(s,length))
M <- sapply(s,function(x){c(x,rep(Inf,n-length(x)))})
u <- unique(unlist(s))
r <- u[sapply(u,function(x){sum(rowSums(M==x)>0)>1})]
Then
> r
[1] "1" "4" "7" "2" "8"
are the elements that have to be removed. "Inf" is used to fill the gaps in the matrix "M" with something that doesn't appear in "df1$B". The matrix "M" is transpose to "df1$B". Therefore I used "rowSums" to check if an element appears in a column of "df1$B". If the strings in "df1$B" are meant to be columns, replace "rowSums" by "colSums".
Upvotes: 0
Reputation: 32446
You can make a table of unique values from each list and remove those with counts greater than 1.
tab <- table(unlist(sapply(s, unique))) < 2
lapply(s, function(x) x[tab[x]])
Upvotes: 1