Reputation: 4103
import re
import urllib.request
file_txt = urllib.request.urlopen("ftp://ftp.sec.gov/edgar/data/1408597/0000930413-12-003922.txt")
pattern_item4= re.compile("(Item\\n*\s*4.*)Item\\n*\s*5")
print(re.search(pattern_item4,bytes.decode(f)))
#Returns None
This regex returns what I want in rubular, but obviously it doesn't do what is expected in Python. Would anyone help me abit with this. The intention of the regex is to basically extract stuff between item4 and item5.
Thank you
Upvotes: 1
Views: 211
Reputation: 75232
Knowing where the newlines are doesn't help you locate the matches, so there's no need to match \n
specifically; it's just another whitespace character. Try this:
r"(?s)Item\s+4\..*?(?=Item\s+5\.)"
(?s)
enables the .
to match newlines, so .*?
consumes everything until the lookahead - (?=Item\s*\d+\.)
- spots the beginning of the next "Item" entry. If you wanted to iterate over all the Items, could replace the 4
and 5
with \d+
.
Upvotes: 0
Reputation: 414315
You need re.DOTALL flag otherwise .
doesn't match a newline. To match Item
at EOL you could use $
with re.MULTILINE flag:
pattern = re.compile(r"(Item$\s*4.*)Item$\s*5", re.S | re.M)
Upvotes: 1
Reputation: 48577
Try using raw strings
re.compile (r"(Item\\n*\s*4.*)Item\\n*\s*5")
I would guess it has to do with your escaping of \n
. But it's impossible to tell without knowing exactly what it is you're expecting that to match.
Upvotes: 1