RUser
RUser

Reputation: 598

R Return position of word in a string

I have data like this:

data <- data.frame(
  text = c(
    "PARACETAMOL/CODEINE",
    "PSEUDOEPH/PARACET/CODEINE",
    "PARACETAMOL/CODEINE/DOXYLAMINE",
    "CODEINE & ASPIRIN",
    "CODEINE/PARACETAMOL",
    "TEST"
  ),
  stringsAsFactors = F
)

I want to return in each case, in what position CODEINE occurs, i.e

text                             position
PARACETAMOL/CODEINE                     2
PSEUDOEPH/PARACET/CODEINE               3
PARACETAMOL/CODEINE/DOXYLAMINE          2
CODEINE & ASPIRIN                       1
CODEINE/PARACETAMOL                     1
TEST                                    0

I prefer a DPLYR solution to run over hundreds of rows.

I looked at various other Stackoverflow answers, but I just can't seem to get it working. They mostly deal with word indexes and not position relative to other words. An idea would be to tokenise and then count position with something like tidytext, but I think there could be an easier way. I suspect it is some nifty REGEX.

UPDATED

I neglected to add a non CODEINE based element, both answers errors out.

Any help would be greatly appreciated.

Upvotes: 1

Views: 55

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388817

Maybe there is a direct regex solution that will help you achieve that. Here is a way splitting the string into different words and count the word number where "CODEINE" occurs.

library(dplyr)

data %>%
  mutate(text1 = stringr::str_extract_all(text, "\\w+"), 
         position = purrr::map_int(text1, 
                     ~max(which(.x == "CODEINE")[1], 0L, na.rm = TRUE))) %>%
  select(-text1)

#                            text position
#1            PARACETAMOL/CODEINE        2
#2      PSEUDOEPH/PARACET/CODEINE        3
#3 PARACETAMOL/CODEINE/DOXYLAMINE        2
#4              CODEINE & ASPIRIN        1
#5            CODEINE/PARACETAMOL        1
#6                           TEST        0

Using the same logic in base R, this can be done as :

sapply(strsplit(data$text, "/|\\&"), function(x) 
         max(which(trimws(x) == "CODEINE")[1], 0, na.rm = TRUE))
#[1] 2 3 2 1 1 0

Upvotes: 3

dc37
dc37

Reputation: 16178

Not the most straightforward solution, but you can use grep and strsplit. You can add an ifelse statement to test for absence of values and fill with 0 if it is the case.

Altogether, you can write something like:

library(dplyr)

data %>% rowwise() %>% 
  mutate(Position = replace_na(ifelse(is.null(grep("CODEINE", unlist(strsplit(text,"/|\\&")))),NA,
                           grep("CODEINE", unlist(strsplit(text,"/|\\&")))),0))


Source: local data frame [7 x 2]
Groups: <by row>

# A tibble: 7 x 2
  text                           Position
  <chr>                             <dbl>
1 PARACETAMOL/CODEINE                   2
2 PSEUDOEPH/PARACET/CODEINE             3
3 PARACETAMOL/CODEINE/DOXYLAMINE        2
4 CODEINE & ASPIRIN                     1
5 CODEINE/PARACETAMOL                   1
6 PARA & CODEINE                        2
7 TEST                                  0

Upvotes: 2

Related Questions