Alex Zel
Alex Zel

Reputation: 688

Beautifulsoup, grab text with link

i'm making a web spider to automate some of my work. I have a table with lots of drivers and different version for different operating systems. So far everything works fine but i'm having a hard time separating the links for each operating system. I'll post part of the html here, but i can't post the whole page. The problem is i don't know can i grab each link and the text that's next to it, i can grab all of them but then i don't know what links if for what operating system.

This is the content of one cell in the table, all i need is to get the link along with the OS verstion (win8.1, win10, win7)

<p class="MsoNormal" style="mso-line-height-alt:9.3pt"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:#1F497D">SfP/StP
    <a href="LINK_TO_FILE">AHWFW0609P_WinB</a>.zip
    </span><span class="MsoHyperlink"><b><sup><span style="font-size:11.0pt;
    font-family:&quot;Calibri&quot;,sans-serif;color:#984806;background:white;text-decoration:
    none;text-underline:none">Win8.1</span></sup></b></span><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,sans-serif"><o:p></o:p></span></p>

<p class="MsoNormal" style="mso-line-height-alt:9.3pt"><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:#1F497D">SfP/StP
    <a href="LINK_TO_FILE">AHWFW0553P_WinT</a></span><span style="color:#1F497D">.</span><span style="font-size:11.0pt;font-family:
    &quot;Calibri&quot;,sans-serif;color:#1F497D">zip</span><span class="MsoHyperlink"><b><sup><span style="font-size:11.0pt;font-family:&quot;Calibri&quot;,sans-serif;color:#984806;
    background:white;text-decoration:none;text-underline:none"> Win10</span></sup></b></span><span class="MsoHyperlink"><b><sup><span style="color:#984806;background:white;
    text-decoration:none;text-underline:none"><o:p></o:p></span></sup></b></span></p>

here is the code i use to grab the names and the links.

file = open(r"Path to HTML file", 'rb')
drivers = {}
rng_lst = [str(x) for x in range(5, 43)]

soup = bs4.BeautifulSoup(file)

table = soup.findAll('table')[0]
for row in table.findAll('tr'):
    cells = row.findAll('td')
    if len(cells) > 4:
        cell_num = cells[1].get_text(strip=True)
        if any(cell_num == n for n in rng_lst):
            drv_name = cells[2].get_text(strip=True)
            drivers[drv_name] = {'links': []}
            links = cells[4].findAll('a')
            for link in links:
                drivers[drv_name]['links'].append(link.get('href'))

Upvotes: 0

Views: 341

Answers (2)

user3078365
user3078365

Reputation:

 os_cell = cells[4]
 os_span = os_cell.find("span", class_="MsoHyperlink")
 os = os_span.string

Upvotes: 0

nu11p01n73R
nu11p01n73R

Reputation: 26677

Assuming that string contains the html content

from bs4 import BeautifulSoup

soup = BeautifulSoup(string)

for pTag in soup.find_all('p'):
        anchorTag =  pTag.findNext('a')
        linkText =  pTag.find('span', {'class' : 'MsoHyperlink' } ).span.text
        print "LpTag.findNext('a')ink : ", anchorTag["href"]
        print "Text to Link ", linkText
        print

would give you an output

Link :  LINK_TO_FILE
Text to Link  Win8.1

Link :  LINK_TO_FILE
Text to Link   Win10

What it does?

By looking at the input string we can know that the anchors and text that we are interested in are present within the p tags.

And that the text comes within a span tag which inturn comes under another span tag which occurs next to the anchor tag.


  • soup.find_all('p') The find_all will return the list of p tags.

  • pTag.findNext('a') For each of the p tag, the findNext will find the next occurence of the anchor tag. This anchor tag contains the relevant link

  • pTag.find('span', {'class' : 'MsoHyperlink' } ) The find will find the span within the current p tag with the attribute class set as MsoHyperlink

    • .span returns the span within the span returned bye find

    • .text returns the text of the corresponding span

Upvotes: 1

Related Questions