Reputation: 345
I have a problem with urllib in which I can't seem to scrape my own local website. I can get it to print out all the contents of the website but the regex or something doesn't work. The output I get with the current code is just []
. So I was wondering what I am doing wrong? I haven't used urllib in a while so it is very possible I missed something obvious. Python file:
import urllib
import re
htmlfile=urllib.urlopen('IP of server')
htmltext=htmlfile.read()
regex="<body>(.+?)</body>"
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price
HTML file:
<html>
<body>
This is a basic HTML file to try to get my python file to work...
</body>
</html>
Thanks a bunch in advance!
Upvotes: 1
Views: 85
Reputation: 70732
A few things wrong here. You need to enable the dotall modifier which forces the dot to span across newline sequences. As far as the following lines containing your compiled regex and call to findall
, it should be:
regex = "<body>(.+?)</body>"
pattern = re.compile(regex, re.DOTALL)
price = pattern.findall(htmltext)
Which could be simplified as below and I would recommend discarding the whitespace from the match result.
price = re.findall(r'(?s)<body>\s*(.+?)\s*</body>', htmltext)
For future reference, use a parser such as BeautifulSoup to extract the data instead of regular expression.
Upvotes: 2
Reputation: 473873
Alternatively, and actually this should be preferred to regex-based approach - use an HTML Parser.
Example (using BeautifulSoup
):
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <html>
... <body>
... This is a basic HTML file to try to get my python file to work...
... </body>
... </html>
... """
>>> soup = BeautifulSoup(data)
>>> print soup.body.get_text(strip=True)
This is a basic HTML file to try to get my python file to work...
Note how simple the code is, no "regex magic".
Upvotes: 2
Reputation: 43166
The dot .
does not match line breaks unless you set the dot-matches-all s
modifier:
re.compile('<body>(.+?)</body>', re.DOTALL)
Upvotes: 1