user3818089
user3818089

Reputation: 345

Why doesn't urllib work with local website?

I have a problem with urllib in which I can't seem to scrape my own local website. I can get it to print out all the contents of the website but the regex or something doesn't work. The output I get with the current code is just []. So I was wondering what I am doing wrong? I haven't used urllib in a while so it is very possible I missed something obvious. Python file:

import urllib
import re

htmlfile=urllib.urlopen('IP of server')
htmltext=htmlfile.read()
regex="<body>(.+?)</body>"
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price 

HTML file:

<html>
    <body>
        This is a basic HTML file to try to get my python file to work...
    </body>
</html>

Thanks a bunch in advance!

Upvotes: 1

Views: 85

Answers (3)

hwnd
hwnd

Reputation: 70732

A few things wrong here. You need to enable the dotall modifier which forces the dot to span across newline sequences. As far as the following lines containing your compiled regex and call to findall, it should be:

regex = "<body>(.+?)</body>"
pattern = re.compile(regex, re.DOTALL)
price = pattern.findall(htmltext)

Which could be simplified as below and I would recommend discarding the whitespace from the match result.

price = re.findall(r'(?s)<body>\s*(.+?)\s*</body>', htmltext)

For future reference, use a parser such as BeautifulSoup to extract the data instead of regular expression.

Upvotes: 2

alecxe
alecxe

Reputation: 473873

Alternatively, and actually this should be preferred to regex-based approach - use an HTML Parser.

Example (using BeautifulSoup):

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
... <html>
...     <body>
...         This is a basic HTML file to try to get my python file to work...
...     </body>
... </html>
... """
>>> soup = BeautifulSoup(data)
>>> print soup.body.get_text(strip=True)
This is a basic HTML file to try to get my python file to work...

Note how simple the code is, no "regex magic".

Upvotes: 2

Aran-Fey
Aran-Fey

Reputation: 43166

The dot . does not match line breaks unless you set the dot-matches-all s modifier:

re.compile('<body>(.+?)</body>', re.DOTALL)

Upvotes: 1

Related Questions