Reputation: 297
EDIT - this a concurrent futures issue, not a BS4 issue. Concurrent futures was returning an empty list after retrieving data - which resulted in the NoneType error from BS4.
I'm attempting to scrape H1s from a list of URLs using Beautiful Soup, but getting the error TypeError: 'str' object is not callable
on one of the URLs.
If I print the output, I can see I have retrieved 3 of the h1's before the error.
If I remove .h1.text.strip()
I get a different error, although curiously it prints the html 4 times not 3.
e.g. I changed bsh_h1 = bsh.h1.text.strip()
to bsh_h1 = bsh
and it ran four times instead and produced the error: TypeError: unhashable type: 'ResultSet'
I've tried placing try / except block but it seems to product errors elsewhere in the code, so feel I'm missing something fundamental. Maybe BS4 is returning NoneType and I need to skip it?
Here's my code.
import concurrent.futures
from bs4 import BeautifulSoup
from urllib.request import urlopen
CONNECTIONS = 1
archive_url_list = [
"https://web.archive.org/web/20171220015929/http://www.manueldrivingschool.co.uk:80/prices.php",
"https://web.archive.org/web/20160313085709/http://www.manueldrivingschool.co.uk/lessons_prices.php",
"https://web.archive.org/web/20171220002420/http://www.manueldrivingschool.co.uk:80/prices",
"https://web.archive.org/web/20201202094502/https://www.manueldrivingschool.co.uk/success",
]
archive_h1_list = []
def get_archive_h1(h1_url):
html = urlopen(h1_url)
bsh = BeautifulSoup(html.read(), 'lxml')
bsh = bsh.h1.text.strip()
return bsh.h1.text.strip()
def concurrent_calls():
with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:
f1 = executor.map(get_archive_h1, archive_url_list)
for future in concurrent.futures.as_completed(f1):
try:
data = future.result()
archive_h1_list.append(data)
except Exception:
archive_h1_list.append("No Data Received!")
pass
if __name__ == '__main__':
concurrent_calls()
print(archive_h1_list)
PS: I'm using concurrent futures for multithreading. I initially thought that was the issue, but I'm leaning towards BS4 now..
Upvotes: 1
Views: 307
Reputation: 11060
You are extracting the string from bsh
, and then trying to access h1
from it - which will fail as the string doesn't have an h1
method/attribute.
bsh = bsh.h1.text.strip()
return bsh.h1.text.strip()
Instead just do:
return bsh.h1.text.strip()
Upvotes: 3