Rez99
Rez99

Reputation: 389

How do I use grep on a data frame?

I have the following data frame:

> my.data
  A.Seats    B.Seats
1   14,15   14,15,16
2       7        7,8
3   12,13      16,17
4    <NA>      10,11

I would like to check if the string within any row in column "A.Seats" is found within the same row of column "B.Seats". So the output would look something like this:

  A.Seats    B.Seats    Check
1   14,15   14,15,16     TRUE
2       7        7,8     TRUE
3   12,13      16,17    FALSE
4    <NA>      10,11    FALSE

But I don't know how to create this table. As a start, I tried using grep:

grep(my.data$A.Seats,my.data$B.Seats)

But I receive the following output

[1] 1
Warning message:
In grep(my.data$A.Seats, my.data$B.Seats) :
argument 'pattern' has length > 1 and only the first element will be used

...and I can't get past this error. Any ideas as to how I can get the intended result?

Many Thanks

Upvotes: 3

Views: 10429

Answers (2)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

The "stringi" library has some vectorized functions that might be useful for something like this. I would suggest the stri_detect() function. Here's an example with some reproducible sample data. Note the difference in the values in the first and last row, and the difference in the results according to whether a regex or fixed approach was taken:

my.data <- data.frame(
    A.Seats = c("14,15", "7", "12,13", NA, "14,19"),
    B.Seats = c("14,15,16", "7,8", "16,17", "10,11", "14,15,16"))
my.data
#   A.Seats  B.Seats
# 1   14,15 14,15,16
# 2       7      7,8
# 3   12,13    16,17
# 4    <NA>    10,11
# 5   14,19 14,15,16

library(stringi)
stri_detect(my.data$B.Seats, fixed = my.data$A.Seats)
# [1]  TRUE  TRUE FALSE    NA FALSE
stri_detect(my.data$B.Seats, regex = gsub(",", "|", my.data$A.Seats))
# [1]  TRUE  TRUE FALSE    NA  TRUE

The first option above treats the values in my.data$A.Seats as a fixed string pattern. The second option treats it as a regular expression to match any of the values.

Note that this maintains NA as NA, but that can easily be changed to FALSE if you need to.


If you don't want to think too much about mapply, you can consider Vectorize to make a vectorized version of grepl. Something like the following should do it:

vGrepl <- Vectorize(grepl)
vGrepl(my.data$A.Seats, my.data$B.Seats)                 # pattern is fixed
# [1]  1  1  0 NA  0
vGrepl(gsub(",", "|", my.data$A.Seats), my.data$B.Seats) # pattern is regex
# 14|15     7 12|13  <NA> 14|19 
#     1     1     0    NA     1 
as.logical(vGrepl(my.data$A.Seats, my.data$B.Seats))     # coerce to logical
# [1]  TRUE  TRUE FALSE    NA FALSE

Because this calls grepl on each element in the vector, I don't think this will scale well though.

Upvotes: 2

Jilber Urbina
Jilber Urbina

Reputation: 61154

This is an approach to get what you need

> List <- lapply(my.data, function(x) strsplit(as.character(x), ","))
> transform(my.data, Check=sapply(mapply("%in%", List[[1]], List[[2]]), any))
  A.Seats  B.Seats Check
1   14,15 14,15,16  TRUE
2       7      7,8  TRUE
3   12,13    16,17 FALSE
4    <NA>    10,11 FALSE

Here's an alternative using grep

>transform(my.data, 
          Check=sapply(suppressWarnings(mapply("grep", List[[1]], List[[2]])), any))

Upvotes: 1

Related Questions