Reputation: 397
Following on from a solved thread here: matching strings regex exact match (with a bit thank-you to @Onyambu for the updated code).
I need to match strings exactly - even if there are special characters.
Note - apologies this is the third question on this issue. I am nearly there but now I don't know how to handle special characters and I am still upskilling on manipulating strings in r.
UPDATED FOR CLARITY:
I have a table of match words / strings like this:
codes <- structure(
list(
column1 = structure(
c(2L, 3L, NA),
.Label = c("",
"4+", "4 +"),
class = "factor"
),
column2 = structure(
c(1L,
3L, 2L),
.Label = c("old", "the money", "work"),
class = "factor"
),
column3 = structure(
c(3L, 2L, NA),
.Label = c("", "wonderyears",
"woke"),
class = "factor"
)
),
row.names = c(NA,-3L),
class = "data.frame"
)
And a dataset that has a column of strings. I want to see if any of the codes are included in each of the records in strings:
strings<- structure(
list(
SurveyID = structure(
1:4,
.Label = c("ID_1", "ID_2",
"ID_3", "ID_4"),
class = "factor"
),
Open_comments = structure(
c(2L,
4L, 3L, 1L),
.Label = c(
"I need to pick up some apples",
"The system works",
"Flag only if there is a 4 with a plus",
"Show me the money"
),
class = "factor"
)
),
class = "data.frame",
row.names = c(NA,-4L)
)
I am currently matching the codes to the strings using the following code:
strings[names(codes)] <- lapply(codes, function(x)
+(grepl(paste0("\\b", na.omit(x), "\\b", collapse = "|"), strings$Open_comments)))
Output:
SurveyID Open_comments column1 column2 column3
1 ID_1 The system works 0 0 0
2 ID_2 Show me the money 0 1 0
3 ID_3 Flag only if there is a 4 with a plus 1 0 0
4 ID_4 I need to pick up some apples 0 0 0
Issue - Row 3 ID_3 I only want to flag this if the string includes "4+" or "4 +", but it is being flagged anyway. Is there anyway to capture it exactly?
Upvotes: 1
Views: 297
Reputation: 887223
We can escape the +
to evaluate it literally
+(grepl(paste0( "(", gsub("\\+", "\\\\+", na.omit(codes$column1)), ")",
collapse="|"), strings$Open_comments))
#[1] 0 0 0 0
If we use a string with 4+
, it would pick up
+(grepl(paste0( "(", gsub("\\+", "\\\\+", na.omit(codes$column1)), ")",
collapse="|"), "Flag only if there is a 4+ with a plus"))
#[1] 1
And for the multiple columns
sapply(codes, function(x)+(grepl(paste0( "\\b(",
gsub("\\+", "\\\\+", na.omit(x)), ")\\b",
collapse="|"), strings$Open_comments)))
# column1 column2 column3
#[1,] 0 0 0
#[2,] 0 1 0
#[3,] 0 0 0
#[4,] 0 0 0
Upvotes: 2