Lee Roy
Lee Roy

Reputation: 297

BeautifulSoup: "TypeError: 'str' object is not callable" when using .h1.text()

EDIT - this a concurrent futures issue, not a BS4 issue. Concurrent futures was returning an empty list after retrieving data - which resulted in the NoneType error from BS4.

I'm attempting to scrape H1s from a list of URLs using Beautiful Soup, but getting the error TypeError: 'str' object is not callable on one of the URLs.

If I print the output, I can see I have retrieved 3 of the h1's before the error.

If I remove .h1.text.strip() I get a different error, although curiously it prints the html 4 times not 3.

e.g. I changed bsh_h1 = bsh.h1.text.strip() to bsh_h1 = bsh and it ran four times instead and produced the error: TypeError: unhashable type: 'ResultSet'

I've tried placing try / except block but it seems to product errors elsewhere in the code, so feel I'm missing something fundamental. Maybe BS4 is returning NoneType and I need to skip it?

Here's my code.

import concurrent.futures
from bs4 import BeautifulSoup
from urllib.request import urlopen

CONNECTIONS = 1

archive_url_list = [
    "https://web.archive.org/web/20171220015929/http://www.manueldrivingschool.co.uk:80/prices.php",
    "https://web.archive.org/web/20160313085709/http://www.manueldrivingschool.co.uk/lessons_prices.php",
    "https://web.archive.org/web/20171220002420/http://www.manueldrivingschool.co.uk:80/prices",
    "https://web.archive.org/web/20201202094502/https://www.manueldrivingschool.co.uk/success",
]

archive_h1_list = []
def get_archive_h1(h1_url):
    html = urlopen(h1_url)
    bsh = BeautifulSoup(html.read(), 'lxml')
    bsh = bsh.h1.text.strip()
    return bsh.h1.text.strip()

def concurrent_calls():
    with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:
        f1 = executor.map(get_archive_h1, archive_url_list)
        for future in concurrent.futures.as_completed(f1):
            try:
                data = future.result()
                archive_h1_list.append(data)
            except Exception:
                archive_h1_list.append("No Data Received!")
                pass

if __name__ == '__main__':
    concurrent_calls()
    print(archive_h1_list)

PS: I'm using concurrent futures for multithreading. I initially thought that was the issue, but I'm leaning towards BS4 now..

Upvotes: 1

Views: 307

Answers (1)

match
match

Reputation: 11060

You are extracting the string from bsh, and then trying to access h1 from it - which will fail as the string doesn't have an h1 method/attribute.

bsh = bsh.h1.text.strip()
return bsh.h1.text.strip()

Instead just do:

return bsh.h1.text.strip()

Upvotes: 3

Related Questions