Keva161
Keva161

Reputation: 2683

Extract text between two pieces of text

I'm trying to use Python to extract text between the below headers:

@HEADER1
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
@othertext

The exact text of @HEADER1 + @othertext might change over time. So I need to to be dynamic.

Also, HEADER2 is a word that starts with an '@'. So is there a startswith function I can use? Or a regular expression?

Something like.

For line in file:
    if(line == 'HEADER1'):
        print next line
        continue = TRUE
    if(continue == TRUE):
        print(line)
    elif(line == othertext):
        break

Upvotes: 0

Views: 2466

Answers (4)

AHR
AHR

Reputation: 99

I use in such occasions partition() method

text_to_extract = "@HEADER1\nExtractMe\nExtractMe\nExtractMe\nExtractMe\nExtractMe\nExtractMe\nExtractMe\nExtractMe\nExtractMe\n@othertext"
extracted = text_to_extract.partition('@HEADER1')[2].partition('@othertext')[0]
print (extracted)

Output:

ExtractMe  
ExtractMe  
ExtractMe  
ExtractMe  
ExtractMe  
ExtractMe
ExtractMe  
ExtractMe  
ExtractMe  

Upvotes: 0

Arount
Arount

Reputation: 10403

This does the job

import re

string = """@HEADER1
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
@othertext"""

print '"{}"'.format(re.split(r'(@HEADER1[\n\r]|[\n\r]@othertext)', string)[2])

output:

"ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe"

Upvotes: 5

Mohammad Yusuf
Mohammad Yusuf

Reputation: 17054

Looking something like this?

import re

string = """@HEADER1
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
@othertext
@HEADER2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
@othertext"""

for a in re.findall(r'@\w+(?:\r\n|\r|\n)(.*?)@\w+(?:\r\n|\r|\n)?', string, re.DOTALL):
    print a

Output:

ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe
ExtractMe

ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2
ExtractMe2

Upvotes: 2

Karim Tabet
Karim Tabet

Reputation: 1847

Without re

string = """@HEADER1
    ExtractMe
    ExtractMe
    ExtractMe
    ExtractMe
    ExtractMe
    ExtractMe
    ExtractMe
    ExtractMe
    ExtractMe
    @othertext"""

You can play around with str.find inside a string splice. Like so:

print(string[string.find("\n"):string.find("\n@")])

Or you can turn the string into a list, get the elements you want and join it back together like so...

list = string.split("\n")
list = list[1:len(list)-1]
print("\n".join(list))

Upvotes: 0

Related Questions