Olivia
Olivia

Reputation: 814

R subset vector when treated as strings

I have a large data frame where I've forced my vectors into a string (using lapply and toString) so they fit into a dataframe and now I can't check if one column is a subset of the other. Is there a simple way to do this.

X <- data.frame(y=c("ABC","A"), z=c("ABC","A,B,C"))

 X
      y     z
1   ABC   ABC
2    A   A,B,C


all(X$y %in% X$z)
[1] FALSE

(X$y[1] %in% X$z[1])
[1] TRUE

(X$y[2] %in% X$z[2])
[1] FALSE

I need to treat each y and z string value as a vector (comma separated) again and then check if y is a subset of z.

In the above case, A is a subset of A,B,C. However because I've treated both as strings, it doesnt work.

In the above y is just one value and z is 1 and 3. The data frames sample I'll be testing is 10,000 rows and the y will have 1-5 values per row and z 1-100 per row. It looks like the 1-5 are always a subset of z, but I'd like to check.

Upvotes: 0

Views: 95

Answers (2)

Benjamin
Benjamin

Reputation: 17279

I think it might work better for you not to use your lapply and toString combination, but store the lists in your data frame. For this purpose, I find the tbl_df (as found in the tibble package) more friendly, although I believe data.table objects can do this as well (someone correct me if I'm wrong)

library(tibble)

y_char <- list("ABC", "A")
z_char <- list("ABC", c("A", "B", "C"))

X <- data_frame(y = y_char, 
                z = z_char)

Notice that when you print X now, your entries in each row of the tibble are entries from the list. Now we can use mapply to do pairwise comparison.

# All y in z
mapply(function(x, y) all(x %in% y),
       X$y,
       X$z)


# All z in y
mapply(function(x, y) all(y %in% x),
       X$y,
       X$z)

Upvotes: 1

joel.wilson
joel.wilson

Reputation: 8413

df = data.frame(y=c("ABC","A"), z=c("ABC","A,B,C"))

apply(df, 1, function(x) {               # perform rowise ops.
  y = unlist(strsplit(x[1], ","))        # splitting X$y if incase it had ","  
  z = y %in% unlist(strsplit(x[2], ",")) # check how many of 'X$y' present in 'X$z'
  if (sum(z) == length(y))               # if all present then return TRUE
    return(TRUE)
  else
    return(FALSE)
})

# 1] TRUE TRUE

# Case 2: changed the data. You will have to define if you want perfect subset or not. Accordingly we can update the code
df = data.frame(y=c("ABC","A,B,D"), z=c("ABC","A,B,C"))
#[1]  TRUE FALSE

Upvotes: 1

Related Questions