Reputation: 39
when I try to run the code below this error was returned. I'd be much appreciated if someone can help to point out where I did wrong. Thank you.
Traceback (most recent call last):
File "web_crawler.py", line 26, in <module>
links = get_all_links(page)
File "web_crawler.py", line 14, in get_all_links
url, endpos = get_next_target(page)
File "web_crawler.py", line 2, in get_next_target
start_link = page.find("<a href=")
TypeError: a bytes-like object is required, not 'str'
def get_next_target(page):
start_link = page.find("<a href=")
if start_link == -1:
return None, 0
start_quote = page.find('"',start_link)
end_quote = page.find('"',start_quote+1)
url = page[start_quote+1:end_quote]
print(url)
return url, end_quote
def get_all_links(page):
links = []
while True:
url, endpos = get_next_target(page)
if url:
links.append(url)
page = page[endpos:]
else:
break
return links
import requests
url='https://en.wikipedia.org/wiki/Moon'
r = requests.get(url)
page = r.content
links = get_all_links(page)
Upvotes: 2
Views: 39
Reputation: 3297
response.content
is the raw contents of the request. They are not decoded it or anything, it's just the raw bytes.
What you want to use instead is the response.text
attribute, which contains the decoded content as a string.
(You also probably want to use an html parsing library like BeautifulSoup instead of your current page.find
approach)
Upvotes: 3