Reputation: 13
I have a file with the format of
sjaskdjajldlj_abc:
cdf_asjdl_dlsf1:
dfsflks %jdkeajd
sdjfls:
adkfld %dk_.(%sfj)sdaj, %kjdflajfs
afjdfj _ajhfkdjf
zjddjh -15afjkkd
xyz
and I want to find the text in between the string _abc:
in the first line and xyz
in the last line.
I have already tried print
re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)
But I got null
.
Upvotes: 2
Views: 1615
Reputation: 104752
It sounds like you have a misunderstanding about what the *
symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".
To match any string, you need to combine *
with .
, which matches any single character (almost, more on this later). The pattern .*
matches any string of zero or more characters.
So, you could change your pattern to be .*abc(.*)xyz
and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .*
is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc
prefix.
The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the .
patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL
(or its shorter spelling, re.S
) as a third argument to re.findall
or re.search
. That flag tells the regular expression system to allow the .
pattern to match any character including newlines.
So, here's how you could turn your current code into a working system:
import re
def find_between(prefix, suffix, text):
pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
return result.group()
else:
return None # or perhaps raise an exception instead
I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.
Upvotes: 0
Reputation: 17004
If I understood the requirement correctly:
a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)
Use re.DOTALL which will enable . to match a newline character as well.
Upvotes: 2
Reputation: 1741
You used re.escape
on your pattern when it contains special characters, so there's no way it will work.
>>>>re.escape("*_abc:")
'\\*_abc\\:'
This will match the actual phrase *_abc:
, but that's not what you want.
Just take the re.escape calls out and it should work more or less correctly.
Upvotes: 0