R subset vector when treated as strings

Question

I have a large data frame where I've forced my vectors into a string (using lapply and toString) so they fit into a dataframe and now I can't check if one column is a subset of the other. Is there a simple way to do this.

X <- data.frame(y=c("ABC","A"), z=c("ABC","A,B,C"))

 X
      y     z
1   ABC   ABC
2    A   A,B,C


all(X$y %in% X$z)
[1] FALSE

(X$y[1] %in% X$z[1])
[1] TRUE

(X$y[2] %in% X$z[2])
[1] FALSE

I need to treat each y and z string value as a vector (comma separated) again and then check if y is a subset of z.

In the above case, A is a subset of A,B,C. However because I've treated both as strings, it doesnt work.

In the above y is just one value and z is 1 and 3. The data frames sample I'll be testing is 10,000 rows and the y will have 1-5 values per row and z 1-100 per row. It looks like the 1-5 are always a subset of z, but I'd like to check.

joel.wilson · Accepted Answer

df = data.frame(y=c("ABC","A"), z=c("ABC","A,B,C"))

apply(df, 1, function(x) {               # perform rowise ops.
  y = unlist(strsplit(x[1], ","))        # splitting X$y if incase it had ","  
  z = y %in% unlist(strsplit(x[2], ",")) # check how many of 'X$y' present in 'X$z'
  if (sum(z) == length(y))               # if all present then return TRUE
    return(TRUE)
  else
    return(FALSE)
})

# 1] TRUE TRUE

# Case 2: changed the data. You will have to define if you want perfect subset or not. Accordingly we can update the code
df = data.frame(y=c("ABC","A,B,D"), z=c("ABC","A,B,C"))
#[1]  TRUE FALSE

R subset vector when treated as strings

Answers (2)

Related Questions