Vinicius Mesel
Vinicius Mesel

Reputation: 157

Crawler doesn't run because of error in htmlfile = urllib.request.urlopen(urls[i])

I'm trying to do a web crawler in which a user writes a websites.txt and the Python code enters and catches URLs one by one and gets the page titles!

import urllib.request
import re

i=0

regex = "<title>(.+?)</title>"
pattern = re.compile(regex)

txtfl = open('websites.txt')
webpgsinfile = txtfl.readlines()
urls = webpgsinfile

while i< len(urls):
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    print(htmltext)
    titles = re.findall(pattern,htmltext)
    print(titles)
    i+=1

But I'm having this error:

Traceback (most recent call last):
  File "C:\Users\Vinicius\Documents\GitHub\python-crawler\scrapper-2-0.py", line 17, in <module>
    titles = re.findall(pattern,htmltext)
  File "C:\Python33\lib\re.py", line 201, in findall
    return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object

Upvotes: 0

Views: 220

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1124248

Either decode the downloaded HTML to unicode text, or use a b'...' bytes regular expression:

regex = b"<title>(.+?)</title>"

or:

htmltext = htmlfile.read().decode(htmlfile.info().get_param('charset', 'utf8'))

However, you are using a regular expression, and matching HTML with such expressions get too complicated, too fast.

Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.

BeautifulSoup example:

from bs4 import BeautifulSoup

response = urllib.request.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().get_param('charset'))
title = soup.find('title').text

Since a title tag itself doesn't contain other tags, you can get away with a regular expression here, but as soon as you try to parse nested tags, you will run into hugely complex issues.

Upvotes: 3

Related Questions