Reputation: 598
I have data like this:
data <- data.frame(
text = c(
"PARACETAMOL/CODEINE",
"PSEUDOEPH/PARACET/CODEINE",
"PARACETAMOL/CODEINE/DOXYLAMINE",
"CODEINE & ASPIRIN",
"CODEINE/PARACETAMOL",
"TEST"
),
stringsAsFactors = F
)
I want to return in each case, in what position CODEINE occurs, i.e
text position
PARACETAMOL/CODEINE 2
PSEUDOEPH/PARACET/CODEINE 3
PARACETAMOL/CODEINE/DOXYLAMINE 2
CODEINE & ASPIRIN 1
CODEINE/PARACETAMOL 1
TEST 0
I prefer a DPLYR solution to run over hundreds of rows.
I looked at various other Stackoverflow answers, but I just can't seem to get it working. They mostly deal with word indexes and not position relative to other words. An idea would be to tokenise and then count position with something like tidytext, but I think there could be an easier way. I suspect it is some nifty REGEX.
UPDATED
I neglected to add a non CODEINE based element, both answers errors out.
Any help would be greatly appreciated.
Upvotes: 1
Views: 55
Reputation: 388817
Maybe there is a direct regex solution that will help you achieve that. Here is a way splitting the string into different words and count the word number where "CODEINE"
occurs.
library(dplyr)
data %>%
mutate(text1 = stringr::str_extract_all(text, "\\w+"),
position = purrr::map_int(text1,
~max(which(.x == "CODEINE")[1], 0L, na.rm = TRUE))) %>%
select(-text1)
# text position
#1 PARACETAMOL/CODEINE 2
#2 PSEUDOEPH/PARACET/CODEINE 3
#3 PARACETAMOL/CODEINE/DOXYLAMINE 2
#4 CODEINE & ASPIRIN 1
#5 CODEINE/PARACETAMOL 1
#6 TEST 0
Using the same logic in base R, this can be done as :
sapply(strsplit(data$text, "/|\\&"), function(x)
max(which(trimws(x) == "CODEINE")[1], 0, na.rm = TRUE))
#[1] 2 3 2 1 1 0
Upvotes: 3
Reputation: 16178
Not the most straightforward solution, but you can use grep
and strsplit
. You can add an ifelse
statement to test for absence of values and fill with 0 if it is the case.
Altogether, you can write something like:
library(dplyr)
data %>% rowwise() %>%
mutate(Position = replace_na(ifelse(is.null(grep("CODEINE", unlist(strsplit(text,"/|\\&")))),NA,
grep("CODEINE", unlist(strsplit(text,"/|\\&")))),0))
Source: local data frame [7 x 2]
Groups: <by row>
# A tibble: 7 x 2
text Position
<chr> <dbl>
1 PARACETAMOL/CODEINE 2
2 PSEUDOEPH/PARACET/CODEINE 3
3 PARACETAMOL/CODEINE/DOXYLAMINE 2
4 CODEINE & ASPIRIN 1
5 CODEINE/PARACETAMOL 1
6 PARA & CODEINE 2
7 TEST 0
Upvotes: 2