panman
panman

Reputation: 1341

R: Index only the first occurrence of a pattern after another pattern

I have a vector of strings like this (part of a much larger one):

a <- c("My string",
       "characters",
       "sentence",
       "text.",
       "My string word sentence word.",
       "Other thing word sentence characters.",
       "My string word sentence numbers.",
       "Other thing",
       "word.",
       "sentence",
       "text.",
       "Other thing word. characters sentence.",
       "Different string word text.",
       "Different string.",
       "word.",
       "sentence.",
       "My string",
       "word",
       "sentence",
       "things.",
       "My string word sentence blah.")

As you see, the vector contains some expressions, some of them in a single element, others split across multiple elements (which is fine). Also note that some of them have multiple periods in the single or the split strings. What I want to achieve is to extract those that start with My string and end with a period in the same element (if the entire expression is in a single string) or at the end of the last element that ends the expression starting with My string.

The way that I imagine this is first, index all elements containing My string:

> b <- grep(pattern = "My string", x = a, fixed = TRUE)
> b
[1]  1  5  7 17 21

Then, index all periods that are at the end of the string:

> c <- grep(pattern = "\\.$", x = a)
> c
 [1]  4  5  6  7  9 11 12 13 14 15 16 20 21

And at the end, obtain only the positions of the FIRST period after each one of the expressions starting with My string (in a single element or spread across elements). Then it would be easy to just subset only the elements that I need to obtain something like this:

d <- c("My string",
       "characters",
       "sentence",
       "text.",
       "My string word sentence word.",
       "My string word sentence numbers.",
       "My string",
       "word",
       "sentence",
       "things.",
       "My string word sentence blah.")

Can someone help with the last step (obtain only the position of the FIRST period after each one of the expressions starting with My string)?

Upvotes: 0

Views: 2411

Answers (2)

Benjamin
Benjamin

Reputation: 17279

Here's an alternative approach with dplyr

library(dplyr)

a <- c("My string",
       "characters",
       "sentence",
       "text.",
       "My string word sentence word.",
       "Other thing word sentence characters.",
       "My string word sentence numbers.",
       "Other thing",
       "word.",
       "sentence",
       "text.",
       "Other thing word. characters sentence.",
       "Different string word text.",
       "Different string.",
       "word.",
       "sentence.",
       "My string",
       "word",
       "sentence",
       "things.",
       "My string word sentence blah.")

data.frame(a = a,
           stringsAsFactors = FALSE) %>%
  mutate(period = grepl("[.]", a), 
         sentence_id = lag(cumsum(period), default = 0)) %>%
  group_by(sentence_id) %>%
  mutate(retain = any(grepl("My string", a))) %>%
  ungroup() %>%
  filter(retain)

The process is to identify elements that have a period and use those indices to indicate when new sentences start. This gives us a sentence_id to group on and then we only need to look for the string "My string".

Upvotes: 2

MrFlick
MrFlick

Reputation: 206242

I think something like this will do what you want

b <- grep(pattern = "My string", x = a, fixed = TRUE)
c <- grep(pattern = "\\.$", x = a)

# find first period for each start string
e <- sapply(b, function(x) head(c[c>=x],1))

# extract ranges
d <- a[unlist(Map(`:`, b,e))]

#  [1] "My string"                       
#  [2] "characters"                      
#  [3] "sentence"                        
#  [4] "text."                           
#  [5] "My string word sentence word."   
#  [6] "My string word sentence numbers."
#  [7] "My string"                       
#  [8] "word"                            
#  [9] "sentence"                        
# [10] "things."                         
# [11] "My string word sentence blah."

Upvotes: 1

Related Questions