zsljulius
zsljulius

Reputation: 4103

Regex passes in Rubular but not in Python

import re
import urllib.request
file_txt = urllib.request.urlopen("ftp://ftp.sec.gov/edgar/data/1408597/0000930413-12-003922.txt")
pattern_item4= re.compile("(Item\\n*\s*4.*)Item\\n*\s*5")
print(re.search(pattern_item4,bytes.decode(f)))
#Returns None

This regex returns what I want in rubular, but obviously it doesn't do what is expected in Python. Would anyone help me abit with this. The intention of the regex is to basically extract stuff between item4 and item5.

Thank you

enter image description here

Upvotes: 1

Views: 211

Answers (3)

Alan Moore
Alan Moore

Reputation: 75232

Knowing where the newlines are doesn't help you locate the matches, so there's no need to match \n specifically; it's just another whitespace character. Try this:

r"(?s)Item\s+4\..*?(?=Item\s+5\.)"

(?s) enables the . to match newlines, so .*? consumes everything until the lookahead - (?=Item\s*\d+\.) - spots the beginning of the next "Item" entry. If you wanted to iterate over all the Items, could replace the 4 and 5 with \d+.

Upvotes: 0

jfs
jfs

Reputation: 414315

You need re.DOTALL flag otherwise . doesn't match a newline. To match Item at EOL you could use $ with re.MULTILINE flag:

pattern = re.compile(r"(Item$\s*4.*)Item$\s*5", re.S | re.M)

Upvotes: 1

Falmarri
Falmarri

Reputation: 48577

Try using raw strings

re.compile (r"(Item\\n*\s*4.*)Item\\n*\s*5")

I would guess it has to do with your escaping of \n. But it's impossible to tell without knowing exactly what it is you're expecting that to match.

Upvotes: 1

Related Questions