Reputation: 57
I'm just starting/learning to use the Google Cloud platform (functions in particular) and I wrote a simple python scraper using BeautifulSoup that is returning an error and I can't figure out why.
from bs4 import BeautifulSoup
import requests
def hello_world(request):
"""Responds to any HTTP request.
Args:
request (flask.Request): HTTP request object.
Returns:
The response text or any set of values that can be turned into a
Response object using
`make_response <http://flask.pocoo.org/docs/1.0/api/#flask.Flask.make_response>`.
"""
url = 'https://example.com/'
req = requests.get(url, headers = {'User-Agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'})
html = req.text
soup = BeautifulSoup(html, 'html.parser')
title = soup.title
print(title)
return title
When I print the title of the scraped page, that shows up in the logs fine. When I return the variable though, the logs report an "IndexError: list index out of range". When I return soup.prettify()
it also works fine.
This is the Traceback that I get in the GCP logs
Traceback (most recent call last): File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 2447, in wsgi_app response = self.full_dispatch_request() File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 1953, in full_dispatch_request return self.finalize_request(rv) File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 1968, in finalize_request response = self.make_response(rv) File "/layers/google.python.pip/pip/lib/python3.9/site-packages/flask/app.py", line 2117, in make_response rv = self.response_class.force_type(rv, request.environ) File "/layers/google.python.pip/pip/lib/python3.9/site-packages/werkzeug/wrappers/base_response.py", line 269, in force_type response = BaseResponse(*_run_wsgi_app(response, environ)) File "/layers/google.python.pip/pip/lib/python3.9/site-packages/werkzeug/wrappers/base_response.py", line 26, in _run_wsgi_app return _run_wsgi_app(*args) File "/layers/google.python.pip/pip/lib/python3.9/site-packages/werkzeug/test.py", line 1123, in run_wsgi_app return app_iter, response[0], Headers(response[1]) IndexError: list index out of range
Upvotes: 0
Views: 1075
Reputation: 281
The problem is probably caused by wrong indentation.
By the way try with this code, maybe it easier to undersand:
from bs4 import BeautifulSoup
import requests
url = 'https://stackoverflow.com'
def titleScaper(url):
req = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"})
soup = BeautifulSoup(req.content, 'html.parser')
soup.encode('utf-8')
return soup.title.get_text()
title = titleScaper(url)
print(title)
Upvotes: 0