J.Doe
J.Doe

Reputation: 139

How to extract a specific string followed by any number?

I have a small problem. I have text in this format:

A.1 Goals

Section 1: Blah Blah Blah
Random sentence A. Random sentence.
Section 2: Blah Blah Blah
Random sentence A.
Random sentence.

A.2 description

I want to obtain output of:

A.1 Goals

Section 1: Blah Blah Blah

Section 2: Blah Blah Blah

A.2 description

So basically how to obtain any string that is repeated more than once and followed by any possible number (any pattern of the same string and varying numbers)

Upvotes: 2

Views: 178

Answers (2)

PKumar
PKumar

Reputation: 11128

You can try this, However I am not sure about the exact output:

string <- c("Section 1: Blah Blah Blah","Random sentence A. Random sentence.",
"Section 2: Blah Blah Blah","Random sentence A.",
"Random sentence.")

 grep("(\\w+)\\s+\\1\\s+\\1",string, value=TRUE)

Logic: The word is wrapped under parenthesis to capture it, then it can be captured into \\1 to get the repetition. Taking two instances of \\1 suggest we want to select it more than twice.

I am assuming the similar structure that , the word must be followed by a space then the word.

Output:

[1] "Section 1: Blah Blah Blah" "Section 2: Blah Blah Blah"

Added after OP Request:

With invert = TRUE in grep, you can change the matching

 grep("(\\w+)\\s+\\1\\s+\\1",string, value=TRUE,invert = TRUE)

Hence above regex would result to :

#[1] "Random sentence A. Random sentence."
#[2] "Random sentence A."                 
#[3] "Random sentence." 

Upvotes: 2

akrun
akrun

Reputation: 887118

We can use grep after reading with readLines. Here, we match either the letter ("A" followed by a . followed by one or more numbers - \\d+) or (|) if the text starts with "Section" (^Section) followed by some characters (.*) and if there is a repeated word followed spaces ((\\w+\\s*)\\1 - \\1 is the backreference for the captured group)

out <- grep("(^A\\.\\d+)|(^Section.*\\b(\\w+\\s*)\\1)", lines, value = TRUE)
cat(out, sep= "\n\n")
#A.1 Goals

#Section 1: Blah Blah Blah

#Section 2: Blah Blah Blah

#A.2 description

data

lines <- readLines("file.txt") #reading from the file

Upvotes: 3

Related Questions