Eric Nguyen
Eric Nguyen

Reputation: 39

Error while crawling web python

when I try to run the code below this error was returned. I'd be much appreciated if someone can help to point out where I did wrong. Thank you.

Traceback (most recent call last):
  File "web_crawler.py", line 26, in <module>
    links = get_all_links(page)
  File "web_crawler.py", line 14, in get_all_links
    url, endpos = get_next_target(page)
  File "web_crawler.py", line 2, in get_next_target
    start_link = page.find("<a href=")
TypeError: a bytes-like object is required, not 'str'

def get_next_target(page):
    start_link = page.find("<a href=")
    if start_link == -1:
        return None, 0
    start_quote = page.find('"',start_link)
    end_quote = page.find('"',start_quote+1)
    url = page[start_quote+1:end_quote]
    print(url)
    return url, end_quote

def get_all_links(page):
    links = []
    while True:
        url, endpos = get_next_target(page)
        if url:
            links.append(url)
            page = page[endpos:]
        else:
            break
    return links

import requests
url='https://en.wikipedia.org/wiki/Moon'
r = requests.get(url)
page = r.content
links = get_all_links(page)

Upvotes: 2

Views: 39

Answers (1)

Azsgy
Azsgy

Reputation: 3297

response.content is the raw contents of the request. They are not decoded it or anything, it's just the raw bytes.

What you want to use instead is the response.text attribute, which contains the decoded content as a string.

(You also probably want to use an html parsing library like BeautifulSoup instead of your current page.find approach)

Upvotes: 3

Related Questions