Reputation: 1041
I have the following dummy data frame:
structure(list(ref = structure(1:7, .Label = c("a", "b", "c",
"d", "e", "f", "g"), class = "factor"), gene = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L), .Label = c("gyrA", "parC"), class = "factor"),
result = structure(c(2L, 4L, 6L, 2L, 3L, 5L, 1L), .Label = c("S479T",
"S83L", "S83L, D678E, D741E", "S83L, D87G", "T765E", "V196A, M248V, E678D"
), class = "factor")), class = "data.frame", row.names = c(NA,
-7L))
Which looks like this:
ref gene result
a gyrA S83L
b gyrA S83L, D87G
c gyrA V196A, M248V, E678D
d gyrA S83L
e gyrA S83L, D678E, D741E
f parC T765E
g parC S479T
What I want to do is to check if the numerical value in the column "result" (between the two letters in each entry) is within a specific range, specifically 67-106, but only when the column "gene" == gyrA. This needs to be checked for all numbers in each cell in the "result" column. The result in result_pos should return 1 if any of the numbers in the cell is within the specified range. I tried the following:
df %>%
mutate(gyrA_pos = ifelse(gene == "gyrA", gsub("[[:alpha:]]", "", result), NA),
result_pos = ifelse(gene == "gyrA" & gyrA_pos %in% as.character(seq(from = 67, to = 106)) == TRUE, 1, 0))
This works, but only for the entries with only one value. I also find it tedious to have to create a column with the letters stripped before matching. I want to end up with this:
ref gene result result_pos
a gyrA S83L 1
b gyrA S83L, D87G 1
c gyrA V196A, M248V, E678D 0
d gyrA S83L 1
e gyrA S83L, D678E, D741E 1
f parC T765E NA
g parC S479T NA
Upvotes: 2
Views: 460
Reputation: 26343
Here is a data.table
option.
library(data.table)
setDT(DF)
DF[, `:=`(result = as.character(result), # coerce result to character
result_pos = NA_integer_)] # set result_pos to NA
DF[gene == 'gyrA', result_pos := {
x <-
lapply(strsplit(result, split = ","),
gsub,
pattern = "\\D+",
replacement = "")
as.integer(sapply(x, function(i)
any(as.numeric(i) >= 67 & as.numeric(i) <= 106)))
}][]
# ref gene result result_pos
#1: a gyrA S83L 1
#2: b gyrA S83L, D87G 1
#3: c gyrA V196A, M248V, E678D 0
#4: d gyrA S83L 1
#5: e gyrA S83L, D678E, D741E 1
#6: f parC T765E NA
#7: g parC S479T NA
The idea is to strsplit
the column result
, remove the letters, check for your condition and return as integer, only for the rows where gene == 'gyrA'
.
Upvotes: 1
Reputation: 15072
Here's one way. You can use str_extract_all
to get all the numbers in a result
, not just the first, and then use map
with any
to check if any of the numbers are in the specified range. The end is just to insert NA
where desired and convert to integers.
library(tidyverse)
df <- structure(list(ref = structure(1:7, .Label = c("a", "b", "c", "d", "e", "f", "g"), class = "factor"), gene = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("gyrA", "parC"), class = "factor"), result = structure(c(2L, 4L, 6L, 2L, 3L, 5L, 1L), .Label = c("S479T", "S83L", "S83L, D678E, D741E", "S83L, D87G", "T765E", "V196A, M248V, E678D"), class = "factor")), class = "data.frame", row.names = c(NA, -7L))
df %>%
mutate(
result_pos = result %>%
str_extract_all("\\d+") %>%
map(as.integer) %>%
map_lgl(~ any(.x >= 67L & .x <= 106L)),
result_pos = if_else(gene != "gyrA", NA, result_pos),
result_pos = as.integer(result_pos)
)
#> ref gene result result_pos
#> 1 a gyrA S83L 1
#> 2 b gyrA S83L, D87G 1
#> 3 c gyrA V196A, M248V, E678D 0
#> 4 d gyrA S83L 1
#> 5 e gyrA S83L, D678E, D741E 1
#> 6 f parC T765E NA
#> 7 g parC S479T NA
Created on 2018-09-04 by the reprex package (v0.2.0).
Upvotes: 2