anasvaf
anasvaf

Reputation: 108

BeautifulSoup failing to parse an HTML webpage - javascript error

I am trying to parse a web-page using bs4 and lxml. In particular I am trying to extract information from the Web of Science using the following code:

def parse_all_authors(soup, author_name):
    pages_left = True
    articles = [] #list of articles
    while pages_left:
        articles.extend(soup.find_all('a', {"class": "smallV110"}))
        a = soup.find('a', {"class": "paginationNext", "title": "Next Page"})
        if a:
            link = a["href"]
            soup = BeautifulSoup(requests.get(link).text, "lxml")
        else:
            pages_left = False
    coauthors = {}

    for article in articles:
        link = article["href"]
        soup = BeautifulSoup(requests.get("https://apps.webofknowledge.com" + link).text, "lxml")
        add_coauthors = soup.find_all('a', {"title": "Find more records by this author"})
        for auth in add_coauthors:
            name = auth.text
            names = name.split(',')
            last_name = str(names[0].lower())
            url = auth["href"]
            if last_name not in coauthors.keys():
                coauthors[last_name] = url

I want to test if the webpage is parsed correctly using the following code e.g.

soup = BeautifulSoup(requests.get("https://apps.webofknowledge.com/Search.do?product=WOS&SID=R1hBLiuXxLjnVr3iXNn&search_mode=GeneralSearch&prID=770f4d07-ccdf-4e30-a906-a98e4b6eb455").text, "lxml")

and the webpage is parsed correctly.

However when I call my function parse_all_authors with the same "soup" variable and a string with the author that I want to search for, I get the following error: requests.exceptions.InvalidURL: Failed to parse: apps.webofknowledge.comjavascript:;

I cannot locate this ("apps.webofknowledge.comjavascript:;") when I am trying to view the page source. I have tried to parse the same page with a simple html.parser or html5lib instead of "lxml" but I still get the same error.

Could you help me with that?

Upvotes: 0

Views: 853

Answers (1)

helb
helb

Reputation: 3234

There's a link with href="javascript:;" and class="smallV110" for each article in the page source, matched by your soup.find_all('a', {"class": "smallV110"}) and thus added to articles (and then passed to requests.get).

You probably want to select just the actual links with href="/full_record.do?…".

This should do:

articles.extend(soup.find_all('a', {"class": "smallV110", "href": lambda href: href.startswith("/full_record.do")}))

(or alternatively lambda href: href != "javascript:;", if it suits your needs better)

Upvotes: 1

Related Questions