Reputation: 139
I have a small problem. I have text in this format:
A.1 Goals
Section 1: Blah Blah Blah
Random sentence A. Random sentence.
Section 2: Blah Blah Blah
Random sentence A.
Random sentence.
A.2 description
I want to obtain output of:
A.1 Goals
Section 1: Blah Blah Blah
Section 2: Blah Blah Blah
A.2 description
So basically how to obtain any string that is repeated more than once and followed by any possible number (any pattern of the same string and varying numbers)
Upvotes: 2
Views: 178
Reputation: 11128
You can try this, However I am not sure about the exact output:
string <- c("Section 1: Blah Blah Blah","Random sentence A. Random sentence.",
"Section 2: Blah Blah Blah","Random sentence A.",
"Random sentence.")
grep("(\\w+)\\s+\\1\\s+\\1",string, value=TRUE)
Logic: The word is wrapped under parenthesis to capture it, then it can be captured into \\1
to get the repetition. Taking two instances of \\1
suggest we want to select it more than twice.
I am assuming the similar structure that , the word must be followed by a space then the word.
Output:
[1] "Section 1: Blah Blah Blah" "Section 2: Blah Blah Blah"
Added after OP Request:
With invert = TRUE
in grep
, you can change the matching
grep("(\\w+)\\s+\\1\\s+\\1",string, value=TRUE,invert = TRUE)
Hence above regex would result to :
#[1] "Random sentence A. Random sentence."
#[2] "Random sentence A."
#[3] "Random sentence."
Upvotes: 2
Reputation: 887118
We can use grep
after reading with readLines
. Here, we match either
the letter ("A" followed by a .
followed by one or more numbers - \\d+
) or (|
) if the text starts with "Section" (^Section
) followed by some characters (.*
) and if there is a repeated word followed spaces ((\\w+\\s*)\\1
- \\1
is the backreference for the captured group)
out <- grep("(^A\\.\\d+)|(^Section.*\\b(\\w+\\s*)\\1)", lines, value = TRUE)
cat(out, sep= "\n\n")
#A.1 Goals
#Section 1: Blah Blah Blah
#Section 2: Blah Blah Blah
#A.2 description
lines <- readLines("file.txt") #reading from the file
Upvotes: 3