sundar_ima
sundar_ima

Reputation: 3910

How to extract text between matching strings including match strings and lines

I am working on python to extract certain string between match strings. These strings are generated from a list which is again generated dynamically by a separate python function. The list I am working on looks like this:-

sample_list = ['line1 this line a first line',
        'line1 this line is also considered as line one...',
        'line1 this line is the first line',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 this contain other strings',
        'line1 this may contain other strings as well',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 what the heck is it...'
        ]

The output I want is similar to this:-

line1 this line is the first line
line2 this line is second line to be included in output
line3 this should also be included in output
line1 this may contain other strings as well
line2 this line is second line to be included in output
line3 this should also be included in output

As you can see, I want to extract the text/lines which are starting as line1 and ending with line3 (up to line ending). The final output includes both the matching words (ie. line1 and line3).

The code I have tried is:-

# Convert list to string first
list_to_str = '\n'.join(sample_list)
# Get desired output
print(re.findall('\nline1(.*?)\nline2(.*?)\nline3($)', list_to_str, re.DOTALL))

This is what I am getting as an output ():-

[]

Any help is appreciated.

Edit1:- I have done some work and found this nearest solution:-

matches = (re.findall(r"^line1(.*)\nline2(.*)\nline3(.*)$", list_to_str, re.MULTILINE))

for match in matches:
    print('\n'.join(match))

It gives me this output:-

 this line is the first line
 this line is second line to be included in output
 this is the third and it should also be included in output
 this may contain other strings as well
 this line is second line to be included in output...
 this is the third should also be included in output

The output is almost correct but it does not include the match text.

Upvotes: 1

Views: 93

Answers (2)

Cedric Zoppolo
Cedric Zoppolo

Reputation: 4743

This may not be the sharpest way (you may want to use regular expressions), but does output what you want:

sample_list = ['line1 this line a first line',
        'line1 this line is also considered as line one...',
        'line1 this line is the first line',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 this contain other strings',
        'line1 this may contain other strings as well',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 what the heck is it...'
        ]
output = []
text = str
line1 = ""
line2 = ""
line3 = ""
prevStart = ""
for text in sample_list:
    if prevStart == "":
        if text.startswith("line1"):
            prevStart = "line1"
            line1 = text
    elif prevStart == "line1":
        if text.startswith("line2"):
            prevStart ="line2"
            line2 = text
        elif text.startswith("line1"):
            line1 = text
            prevStart = "line1"
        else:
            prevStart = ""
    elif prevStart == "line2":
        if text.startswith("line3"):
            prevStart = ""
            line3 = text
        else:
            prevStart = ""
    if line1 != "" and line2 != "" and line3 != "":
        output.append(line1)
        output.append(line2)
        output.append(line3)
        line1 = ""
        line2 = ""
        line3 = ""

for line in output:
    print line

Output for this code is:

line1 this line is the first line
line2 this line is second line to be included in output
line3 this should also be included in output
line1 this may contain other strings as well
line2 this line is second line to be included in output
line3 this should also be included in output

Upvotes: 1

user557597
user557597

Reputation:

If you're looking for a sequence of line 1,2, and 3 with no duplicates
it is this

line1.*\s*(?!\s|line[13])line2.*\s*(?!\s|line[12])line3.*

Explained

 line1 .* \s*             # line 1 plus newline(s)
 (?! \s | line [13] )     # Next cannot be line 1 or 3 (or whitespace)
 line2 .* \s*             # line 2 plus newline(s)
 (?! \s | line [12] )     # Next cannot be line 1 or 2 (or whitespace)
 line3 .*                 # line 3 

If you want to capture the line content, just put capture groups around (.*)

Upvotes: 2

Related Questions