Keelin
Keelin

Reputation: 397

matching strings regex exact match - special characters

Following on from a solved thread here: matching strings regex exact match (with a bit thank-you to @Onyambu for the updated code).

I need to match strings exactly - even if there are special characters.

Note - apologies this is the third question on this issue. I am nearly there but now I don't know how to handle special characters and I am still upskilling on manipulating strings in r.

UPDATED FOR CLARITY:

I have a table of match words / strings like this:

codes <- structure(
  list(
    column1 = structure(
      c(2L, 3L, NA),
      .Label = c("",
                 "4+", "4 +"),
      class = "factor"
    ),
    column2 = structure(
      c(1L,
        3L, 2L),
      .Label = c("old", "the money", "work"),
      class = "factor"
    ),
    column3 = structure(
      c(3L, 2L, NA),
      .Label = c("", "wonderyears",
                 "woke"),
      class = "factor"
    )
  ),
  row.names = c(NA,-3L),
  class = "data.frame"
)

And a dataset that has a column of strings. I want to see if any of the codes are included in each of the records in strings:

strings<- structure(
  list(
    SurveyID = structure(
      1:4,
      .Label = c("ID_1", "ID_2",
                 "ID_3", "ID_4"),
      class = "factor"
    ),
    Open_comments = structure(
      c(2L,
        4L, 3L, 1L),
      .Label = c(
        "I need to pick up some apples",
        "The system works",
        "Flag only if there is a 4 with a plus",
        "Show me the money"
      ),
      class = "factor"
    )
  ),
  class = "data.frame",
  row.names = c(NA,-4L)
)

I am currently matching the codes to the strings using the following code:

strings[names(codes)] <- lapply(codes, function(x) 
  +(grepl(paste0("\\b", na.omit(x), "\\b", collapse = "|"), strings$Open_comments)))

Output:

  SurveyID                         Open_comments column1 column2 column3
1     ID_1                      The system works       0       0       0
2     ID_2                     Show me the money       0       1       0
3     ID_3 Flag only if there is a 4 with a plus       1       0       0
4     ID_4         I need to pick up some apples       0       0       0

Issue - Row 3 ID_3 I only want to flag this if the string includes "4+" or "4 +", but it is being flagged anyway. Is there anyway to capture it exactly?

Upvotes: 1

Views: 297

Answers (1)

akrun
akrun

Reputation: 887223

We can escape the + to evaluate it literally

+(grepl(paste0( "(", gsub("\\+", "\\\\+", na.omit(codes$column1)), ")",
     collapse="|"), strings$Open_comments))
#[1] 0 0 0 0

If we use a string with 4+ , it would pick up

+(grepl(paste0( "(", gsub("\\+", "\\\\+", na.omit(codes$column1)), ")",
     collapse="|"), "Flag only if there is a 4+ with a plus"))
#[1] 1

And for the multiple columns

sapply(codes, function(x)+(grepl(paste0( "\\b(", 
      gsub("\\+", "\\\\+", na.omit(x)), ")\\b",
      collapse="|"), strings$Open_comments)))
#     column1 column2 column3
#[1,]       0       0       0
#[2,]       0       1       0
#[3,]       0       0       0
#[4,]       0       0       0

Upvotes: 2

Related Questions