Guillaume Beauchamp
Guillaume Beauchamp

Reputation: 21

Replace Nth occurrence of a word (substring) in a string in R, N is the value of an integer column

I want to find the Nth occurence of a word in an utterance and put [brackets] around it. I tried with various things but I think the closest I'm getting is with gsub but I can't have {copy-1} for the number of times in my regex. Any ideas? Can we put a variable in there? Thanks!

#creating my df
utterance <- c("we are not who we think we are", "they know who we are")
df <- data.frame(utterance)
df$occurences = str_count(df$utterance, "we")
df <- df %>% mutate(ID = row_number())
df <- df %>% uncount(occurences) %>% group_by(ID) %>% mutate(copy = row_number()) 

#this is my gsub
gsub("((?:we){copy-1}.*)we", "\\[we\\]", df$utterance) 

This would be my result

    utterance                         ID  copy
    <chr>                          <int> <int>
1 [we] are not who we think we are     1     1
2 we are not who [we] think we are     1     2
3 we are not who we think [we] are     1     3
4 they know who [we] are               2     1

Upvotes: 2

Views: 69

Answers (3)

langtang
langtang

Reputation: 24722

How about just this:

library(tidyverse)

f <- function(s,c,target) {
 g = gregexpr(target,s)[[1]][c]
 if(is.na(g) | g<0) return(s)
 paste0(str_sub(s,1,g-1),"[",target,"]",str_sub(s,1+g+length(target)))
}

df %>% rowwise() %>% mutate(utterance = f(utterance,copy, "we"))

Output:

  utterance                           ID  copy
  <chr>                            <int> <int>
1 [we] are not who we think we are     1     1
2 we are not who [we] think we are     1     2
3 we are not who we think [we] are     1     3
4 they know who [we] are               2     1

Note that this will also find targets that are not whole words. For example the second of occurrence of "we" in "We went where we went yesterday" is the first two letters of "went", not the second occurrence of the word "we". If you want to restrict to whole words, you can update the gregexpr() call to this:

 g = gregexpr(paste0("\\b",target, "\\b"),s)[[1]][c]

Upvotes: 1

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

Here's a mixed approach using a number of additional packages:

library(data.table)
library(tibble)
library(dplyr)
library(tidyr)
df %>%
  rowid_to_column() %>%
  separate_rows(utterance, sep = " ") %>%
  group_by(rowid) %>%
  mutate(wordcount = ifelse(utterance == "we", rleid(rowid), NA), # simpler: wordcount = ifelse(utterance == "we", 1, NA)
         wordcount = cumsum(!is.na(wordcount))) %>% 
  mutate(utterance = ifelse(utterance == "we" & wordcount == copy, paste0("[", utterance, "]"), utterance)) %>% 
  summarise(utterance = paste0(utterance, collapse = " ")) %>%
  bind_cols(.,df[,2:3])
# A tibble: 4 × 4
  rowid utterance                           ID  copy
  <int> <chr>                            <int> <int>
1     1 [we] are not who we think we are     1     1
2     2 we are not who [we] think we are     1     2
3     3 we are not who we think [we] are     1     3
4     4 they know who [we] are               2     1

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521259

Here is a string splitting approach. We can split the input string on we, and then piece together, using [we] as the nth connector.

repn <- function(x, find, repl, n) {
    parts <- strsplit(x, paste0("\\b", find, "\\b"))[[1]]
    output <- paste0(
        paste0(parts[1:n], collapse=find),
        repl,
        paste0(parts[(n+1):length(parts)], collapse="we")
    )

    return(output)
}

x <- "we are not who we think we are"
repn(x, "we", "[we]", 1)
repn(x, "we", "[we]", 2)
repn(x, "we", "[we]", 3)

[1] "[we] are not who we think we are"
[1] "we are not who [we] think we are"
[1] "we are not who we think [we] are"

Upvotes: 0

Related Questions