eamon1234
eamon1234

Reputation: 1585

Python urlopen error with saved webpage

I have saved a webpage in the location C:\webpage.htm . I want to load it and analyse it using BeautifulSoup, however urllib won't open it.

from BeautifulSoup import BeautifulSoup
import urllib2

url="C:\webpage.htm"

page=urllib2.urlopen(url)

This throws up the error:

Traceback (most recent call last):
    page=urllib2.urlopen(url)
  File "C:\Python27\lib\urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 400, in open
    response = self._open(req, data)
  File "C:\Python27\lib\urllib2.py", line 423, in _open
    'unknown_open', req)
  File "C:\Python27\lib\urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "C:\Python27\lib\urllib2.py", line 1240, in unknown_open
    raise URLError('unknown url type: %s' % type)
urllib2.URLError: <urlopen error unknown url type: c>

How do I solve this or is there another way to load the document into beautiful soup (I had tried to save it as a text document but that threw up the error:

'str' object has no attribute 'findall'

Upvotes: 0

Views: 1240

Answers (2)

MrPhilH
MrPhilH

Reputation: 31

Since you are loading a file off of your local machine you do not need to use urllib2. Instead you can used Python's standard file I/O functions: open(), read(), and close()

from BeautifulSoup import BeautifulSoup
url="C:\webpage.htm"
f = open(url)
# read entire file as a string
page=f.read()
soup=BeautifulSoup(page)
# etc...
f.close()

Upvotes: 3

Silvester
Silvester

Reputation: 516

It seems you have to specify the protocol. In this case, what you probably want to do is this:

from BeautifulSoup import BeautifulSoup
import urllib2
url="file:///C:/webpage.html"
page=urllib2.urlopen(url)

Upvotes: 3

Related Questions