Beautifulsoup, grab text with link

Question

i'm making a web spider to automate some of my work. I have a table with lots of drivers and different version for different operating systems. So far everything works fine but i'm having a hard time separating the links for each operating system. I'll post part of the html here, but i can't post the whole page. The problem is i don't know can i grab each link and the text that's next to it, i can grab all of them but then i don't know what links if for what operating system.

This is the content of one cell in the table, all i need is to get the link along with the OS verstion (win8.1, win10, win7)

SfP/StP
    AHWFW0609P_WinB.zip
    ^Win8.1

SfP/StP
    AHWFW0553P_WinT.zip^Win10

here is the code i use to grab the names and the links.

file = open(r"Path to HTML file", 'rb')
drivers = {}
rng_lst = [str(x) for x in range(5, 43)]

soup = bs4.BeautifulSoup(file)

table = soup.findAll('table')[0]
for row in table.findAll('tr'):
    cells = row.findAll('td')
    if len(cells) > 4:
        cell_num = cells[1].get_text(strip=True)
        if any(cell_num == n for n in rng_lst):
            drv_name = cells[2].get_text(strip=True)
            drivers[drv_name] = {'links': []}
            links = cells[4].findAll('a')
            for link in links:
                drivers[drv_name]['links'].append(link.get('href'))

nu11p01n73R · Accepted Answer

Assuming that string contains the html content

from bs4 import BeautifulSoup

soup = BeautifulSoup(string)

for pTag in soup.find_all('p'):
        anchorTag =  pTag.findNext('a')
        linkText =  pTag.find('span', {'class' : 'MsoHyperlink' } ).span.text
        print "LpTag.findNext('a')ink : ", anchorTag["href"]
        print "Text to Link ", linkText
        print

would give you an output

Link :  LINK_TO_FILE
Text to Link  Win8.1

Link :  LINK_TO_FILE
Text to Link   Win10

What it does?

By looking at the input string we can know that the anchors and text that we are interested in are present within the p tags.

And that the text comes within a span tag which inturn comes under another span tag which occurs next to the anchor tag.

soup.find_all('p') The find_all will return the list of p tags.
pTag.findNext('a') For each of the p tag, the findNext will find the next occurence of the anchor tag. This anchor tag contains the relevant link
pTag.find('span', {'class' : 'MsoHyperlink' } ) The find will find the span within the current p tag with the attribute class set as MsoHyperlink
- .span returns the span within the span returned bye find
- .text returns the text of the corresponding span

Beautifulsoup, grab text with link

Answers (2)

Related Questions