Sabor117
Sabor117

Reputation: 135

How to search for only a specific string within a variable in R

In my code I have a string variable (panel_name) which can have a number of different forms along the lines of: CVD II or Onc, IR or CVD II, CVD III and so-on. I also have a function which then searches this variable for specific strings, and based on their presence outputs other strings.

So, for example, I have:

if (grepl("CVD II", panel_name) == TRUE){

    panel_pref = ""
    panel = "CVD2"

  } else if (grepl("CVD III", panel_name) == TRUE){

    panel_pref = ""
    panel = "CVD3"

  }

The issue I am coming across however is in an example input of CVD II, this will return as "TRUE" if panel_name == CVD III and this is not what I want.

My current solution is to just invert the above code, so it becomes:

if (grepl("CVD III", panel_name) == TRUE){

    panel_pref = ""
    panel = "CVD3"

  } else if (grepl("CVD II", panel_name) == TRUE){

    panel_pref = ""
    panel = "CVD2"

  }

But this feels a little messy, so I am wondering if there is a way to search for a string specifically within another string.

I can't use if x == y (for example) because the variable sometimes contains more than one of the "names" I am searching for, but grepl seems not to have allow exclusions.

Upvotes: 0

Views: 2377

Answers (2)

camille
camille

Reputation: 16842

A couple regex options to use in your if / else tests:

test_cases <- c("CVD II", "CVD III")

Is II found at the end of the string?

grepl("CVD II$", test_cases)
#> [1]  TRUE FALSE

Is II found at the boundary of a word?

grepl("CVD II\\b", test_cases)
#> [1]  TRUE FALSE

Is II found without being followed by another I? Requires perl syntax.

grepl("CVD II(?!I)", test_cases, perl = T)
#> [1]  TRUE FALSE

Or you can skip the if else tests and use a vectorized search and paste. The stringi and stringr packages have several convenience functions.

If you don't expect I to show up otherwise, you can simply count occurrences of I and paste that to CVD.

paste0("CVD", stringi::stri_count_regex(test_cases, "I"))
#> [1] "CVD2" "CVD3"

Or, a somewhat strange option: Your strings contain roman numerals. Extract the I strings that occur after CVD:

stringi::stri_extract_first_regex(test_cases, "(?<=CVD )(I+)")
#> [1] "II"  "III"

You could expand that for higher roman numerals by including something like ([IVX]+). Then convert them to roman numeral objects with utils::as.roman, then regular numeric objects, then paste.

paste0("CVD", 
       as.numeric(as.roman(stringi::stri_extract_first_regex(test_cases, "(?<=CVD )(I+)"))))
#> [1] "CVD2" "CVD3"

Upvotes: 2

Nate
Nate

Reputation: 364

Sabor117,

You should check out ?regexp and expand your use of the regular expressions available to you there. For example, if it's just about distinguishing "CVD II", from "CVD III", then you can just indicate the end of the string with $ as below:

a <- "CVD III"
grepl(x=a,pattern="CVD II$")

Depending on your situation, there could be much better solutions.

Also, if you are new to regular expressions, it helps to be able to experiment with the wildcards and other regex syntax. I would point you too one of the regex resources out there. My personal favorite is https://regex101.com/

Upvotes: 1

Related Questions