user2628665
user2628665

Reputation: 13

Finding big string sequence between two keywords within multiple lines

I have a file with the format of

sjaskdjajldlj_abc:  
cdf_asjdl_dlsf1:  
    dfsflks %jdkeajd  
sdjfls:  
    adkfld  %dk_.(%sfj)sdaj, %kjdflajfs  
    afjdfj _ajhfkdjf  
    zjddjh -15afjkkd  
    xyz  

and I want to find the text in between the string _abc: in the first line and xyz in the last line. I have already tried print

re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)

But I got null.

Upvotes: 2

Views: 1615

Answers (3)

Blckknght
Blckknght

Reputation: 104752

It sounds like you have a misunderstanding about what the * symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".

To match any string, you need to combine * with ., which matches any single character (almost, more on this later). The pattern .* matches any string of zero or more characters.

So, you could change your pattern to be .*abc(.*)xyz and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .* is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc prefix.

The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the . patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL (or its shorter spelling, re.S) as a third argument to re.findall or re.search. That flag tells the regular expression system to allow the . pattern to match any character including newlines.

So, here's how you could turn your current code into a working system:

import re

def find_between(prefix, suffix, text):
    pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
    result = re.search(pattern, text, re.DOTALL)
    if result:
        return result.group()
    else:
        return None # or perhaps raise an exception instead

I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.

Upvotes: 0

Guru
Guru

Reputation: 17004

If I understood the requirement correctly:

a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)

Use re.DOTALL which will enable . to match a newline character as well.

Upvotes: 2

llb
llb

Reputation: 1741

You used re.escape on your pattern when it contains special characters, so there's no way it will work.

>>>>re.escape("*_abc:")
'\\*_abc\\:'

This will match the actual phrase *_abc:, but that's not what you want.

Just take the re.escape calls out and it should work more or less correctly.

Upvotes: 0

Related Questions