hyhno01
hyhno01

Reputation: 177

Create a Column Based on Another Column Using grepl

Let's consider a df with two columns word and stem. I want to create a new column that checks whether the value in stem is entailed in word and whether it is preceded or succeeded by some more characters. The final result should look like this:

WORD     STEM     NEW
rerun    run      prefixed
runner   run      suffixed
run      run      none
...      ...      ...

And below you can see my code so far. However, it does not work because the grepl expression is applied on all rows of the df. Anyways, I think it should make clear my idea.

df$new <- ifelse(grepl(paste0('.+', df$stem, '.+'), df$word), 'both',
             ifelse(grepl(paste0(df$stem, '.+'), df$word), 'suffixed',
                ifelse(grepl(paste0('.+', df$stem), df$word), 'prefixed','none')))

Upvotes: 3

Views: 921

Answers (3)

Ric S
Ric S

Reputation: 9247

You can create the new column like this

df$new <- ifelse(startsWith(df$word, df$stem) & endsWith(df$word, df$stem), 'none',
                 ifelse(startsWith(df$word, df$stem), 'suffixed',
                        ifelse(endsWith(df$word, df$stem), 'prefixed',
                               'both')))

Or, in you are in a dplyr pipeline and you want to avoid all the annoying df$

df %>%
  mutate(new = ifelse(startsWith(df$word, df$stem) & endsWith(df$word, df$stem), 'none',
                      ifelse(startsWith(df$word, df$stem), 'suffixed',
                             ifelse(endsWith(df$word, df$stem), 'prefixed',
                                    'both'))))

Output

#       word stem     new1
# 1    rerun  run prefixed
# 2   runner  run suffixed
# 3      run  run     none
# 4    aruna  run     both

Upvotes: 1

GKi
GKi

Reputation: 39657

You can use mapply to use grepl per line like:

ifelse(mapply(grepl, paste0(".+", x$STEM, ".+"), x$WORD), "both",
ifelse(mapply(grepl, paste0(x$STEM, ".+"), x$WORD), "suffixed",
ifelse(mapply(grepl, paste0(".+", x$STEM), x$WORD), "prefixed", "none")))
#"prefixed" "suffixed"     "none" 

Or using startsWith and endsWith and use subseting form vector:

c("none", "both", "prefixed", "suffixed")[1 + (1 + startsWith(x$WORD, x$STEM) +
 2*endsWith(x$WORD, x$STEM)) * (nchar(x$WORD) > nchar(x$STEM) &
 mapply(grepl, x$STEM, x$WORD))]
#[1] "suffixed" "prefixed" "none"    

Upvotes: 2

Ian Campbell
Ian Campbell

Reputation: 24790

Here's an approach with str_locate from stringr and dplyr:

library(dplyr)
library(stringr)
data %>%
  mutate_at(vars(WORD,STEM), as.character) %>%
  mutate(NEW = 
         case_when(str_locate(WORD,STEM)[,"start"] > 1 &
                   str_locate(WORD,STEM)[,"end"] < nchar(WORD) ~ "both",
                   str_locate(WORD,STEM)[,"start"] > 1 ~ "prefixed",
                   str_locate(WORD,STEM)[,"end"] < nchar(WORD) ~ "suffixed",
                   TRUE ~ "none"))
    WORD STEM      NEW
1  rerun  run prefixed
2 runner  run suffixed
3    run  run     none

I added a line to convert WORD and STEM to character in case they were factors.

Upvotes: 1

Related Questions