Haakonkas
Haakonkas

Reputation: 1041

Check if any of multiple values in a string is within a numerical range R

I have the following dummy data frame:

structure(list(ref = structure(1:7, .Label = c("a", "b", "c", 
"d", "e", "f", "g"), class = "factor"), gene = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L), .Label = c("gyrA", "parC"), class = "factor"), 
    result = structure(c(2L, 4L, 6L, 2L, 3L, 5L, 1L), .Label = c("S479T", 
    "S83L", "S83L, D678E, D741E", "S83L, D87G", "T765E", "V196A, M248V, E678D"
    ), class = "factor")), class = "data.frame", row.names = c(NA, 
-7L))

Which looks like this:

ref  gene  result
a    gyrA  S83L
b    gyrA  S83L, D87G
c    gyrA  V196A, M248V, E678D
d    gyrA  S83L
e    gyrA  S83L, D678E, D741E
f    parC  T765E
g    parC  S479T

What I want to do is to check if the numerical value in the column "result" (between the two letters in each entry) is within a specific range, specifically 67-106, but only when the column "gene" == gyrA. This needs to be checked for all numbers in each cell in the "result" column. The result in result_pos should return 1 if any of the numbers in the cell is within the specified range. I tried the following:

df %>%
   mutate(gyrA_pos = ifelse(gene == "gyrA", gsub("[[:alpha:]]", "", result), NA),
   result_pos = ifelse(gene == "gyrA" & gyrA_pos %in% as.character(seq(from = 67, to = 106)) == TRUE, 1, 0))

This works, but only for the entries with only one value. I also find it tedious to have to create a column with the letters stripped before matching. I want to end up with this:

ref  gene  result                 result_pos
a    gyrA  S83L                   1
b    gyrA  S83L, D87G             1
c    gyrA  V196A, M248V, E678D    0
d    gyrA  S83L                   1
e    gyrA  S83L, D678E, D741E     1
f    parC  T765E                  NA
g    parC  S479T                  NA

Upvotes: 2

Views: 460

Answers (2)

markus
markus

Reputation: 26343

Here is a data.table option.

library(data.table)
setDT(DF)
DF[, `:=`(result = as.character(result), # coerce result to character
          result_pos = NA_integer_)] # set result_pos to NA 
DF[gene == 'gyrA', result_pos := {
  x <-
    lapply(strsplit(result, split = ","),
           gsub,
           pattern = "\\D+",
           replacement = "")
  as.integer(sapply(x, function(i)
    any(as.numeric(i) >= 67 & as.numeric(i) <= 106)))
}][]
#   ref gene              result result_pos
#1:   a gyrA                S83L          1
#2:   b gyrA          S83L, D87G          1
#3:   c gyrA V196A, M248V, E678D          0
#4:   d gyrA                S83L          1
#5:   e gyrA  S83L, D678E, D741E          1
#6:   f parC               T765E         NA
#7:   g parC               S479T         NA

The idea is to strsplit the column result, remove the letters, check for your condition and return as integer, only for the rows where gene == 'gyrA'.

Upvotes: 1

Calum You
Calum You

Reputation: 15072

Here's one way. You can use str_extract_all to get all the numbers in a result, not just the first, and then use map with any to check if any of the numbers are in the specified range. The end is just to insert NA where desired and convert to integers.

library(tidyverse)
df <- structure(list(ref = structure(1:7, .Label = c("a", "b", "c", "d", "e", "f", "g"), class = "factor"), gene = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("gyrA", "parC"), class = "factor"), result = structure(c(2L, 4L, 6L, 2L, 3L, 5L, 1L), .Label = c("S479T", "S83L", "S83L, D678E, D741E", "S83L, D87G", "T765E", "V196A, M248V, E678D"), class = "factor")), class = "data.frame", row.names = c(NA, -7L))

df %>%
  mutate(
    result_pos = result %>%
      str_extract_all("\\d+") %>%
      map(as.integer) %>%
      map_lgl(~ any(.x >= 67L & .x <= 106L)),
    result_pos = if_else(gene != "gyrA", NA, result_pos),
    result_pos = as.integer(result_pos)
  )
#>   ref gene              result result_pos
#> 1   a gyrA                S83L          1
#> 2   b gyrA          S83L, D87G          1
#> 3   c gyrA V196A, M248V, E678D          0
#> 4   d gyrA                S83L          1
#> 5   e gyrA  S83L, D678E, D741E          1
#> 6   f parC               T765E         NA
#> 7   g parC               S479T         NA

Created on 2018-09-04 by the reprex package (v0.2.0).

Upvotes: 2

Related Questions