Check if any of multiple values in a string is within a numerical range R

Question

I have the following dummy data frame:

structure(list(ref = structure(1:7, .Label = c("a", "b", "c", 
"d", "e", "f", "g"), class = "factor"), gene = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L), .Label = c("gyrA", "parC"), class = "factor"), 
    result = structure(c(2L, 4L, 6L, 2L, 3L, 5L, 1L), .Label = c("S479T", 
    "S83L", "S83L, D678E, D741E", "S83L, D87G", "T765E", "V196A, M248V, E678D"
    ), class = "factor")), class = "data.frame", row.names = c(NA, 
-7L))

Which looks like this:

ref  gene  result
a    gyrA  S83L
b    gyrA  S83L, D87G
c    gyrA  V196A, M248V, E678D
d    gyrA  S83L
e    gyrA  S83L, D678E, D741E
f    parC  T765E
g    parC  S479T

What I want to do is to check if the numerical value in the column "result" (between the two letters in each entry) is within a specific range, specifically 67-106, but only when the column "gene" == gyrA. This needs to be checked for all numbers in each cell in the "result" column. The result in result_pos should return 1 if any of the numbers in the cell is within the specified range. I tried the following:

df %>%
   mutate(gyrA_pos = ifelse(gene == "gyrA", gsub("[[:alpha:]]", "", result), NA),
   result_pos = ifelse(gene == "gyrA" & gyrA_pos %in% as.character(seq(from = 67, to = 106)) == TRUE, 1, 0))

This works, but only for the entries with only one value. I also find it tedious to have to create a column with the letters stripped before matching. I want to end up with this:

ref  gene  result                 result_pos
a    gyrA  S83L                   1
b    gyrA  S83L, D87G             1
c    gyrA  V196A, M248V, E678D    0
d    gyrA  S83L                   1
e    gyrA  S83L, D678E, D741E     1
f    parC  T765E                  NA
g    parC  S479T                  NA

Calum You · Accepted Answer

Here's one way. You can use str_extract_all to get all the numbers in a result, not just the first, and then use map with any to check if any of the numbers are in the specified range. The end is just to insert NA where desired and convert to integers.

library(tidyverse)
df <- structure(list(ref = structure(1:7, .Label = c("a", "b", "c", "d", "e", "f", "g"), class = "factor"), gene = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("gyrA", "parC"), class = "factor"), result = structure(c(2L, 4L, 6L, 2L, 3L, 5L, 1L), .Label = c("S479T", "S83L", "S83L, D678E, D741E", "S83L, D87G", "T765E", "V196A, M248V, E678D"), class = "factor")), class = "data.frame", row.names = c(NA, -7L))

df %>%
  mutate(
    result_pos = result %>%
      str_extract_all("\d+") %>%
      map(as.integer) %>%
      map_lgl(~ any(.x >= 67L & .x <= 106L)),
    result_pos = if_else(gene != "gyrA", NA, result_pos),
    result_pos = as.integer(result_pos)
  )
#>   ref gene              result result_pos
#> 1   a gyrA                S83L          1
#> 2   b gyrA          S83L, D87G          1
#> 3   c gyrA V196A, M248V, E678D          0
#> 4   d gyrA                S83L          1
#> 5   e gyrA  S83L, D678E, D741E          1
#> 6   f parC               T765E         NA
#> 7   g parC               S479T         NA

Created on 2018-09-04 by the reprex package (v0.2.0).

Check if any of multiple values in a string is within a numerical range R

Answers (2)

Related Questions