Reputation: 688
i'm making a web spider to automate some of my work. I have a table with lots of drivers and different version for different operating systems. So far everything works fine but i'm having a hard time separating the links for each operating system. I'll post part of the html here, but i can't post the whole page. The problem is i don't know can i grab each link and the text that's next to it, i can grab all of them but then i don't know what links if for what operating system.
This is the content of one cell in the table, all i need is to get the link along with the OS verstion (win8.1, win10, win7)
<p class="MsoNormal" style="mso-line-height-alt:9.3pt"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">SfP/StP
<a href="LINK_TO_FILE">AHWFW0609P_WinB</a>.zip
</span><span class="MsoHyperlink"><b><sup><span style="font-size:11.0pt;
font-family:"Calibri",sans-serif;color:#984806;background:white;text-decoration:
none;text-underline:none">Win8.1</span></sup></b></span><span style="font-size:11.0pt;font-family:"Calibri",sans-serif"><o:p></o:p></span></p>
<p class="MsoNormal" style="mso-line-height-alt:9.3pt"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">SfP/StP
<a href="LINK_TO_FILE">AHWFW0553P_WinT</a></span><span style="color:#1F497D">.</span><span style="font-size:11.0pt;font-family:
"Calibri",sans-serif;color:#1F497D">zip</span><span class="MsoHyperlink"><b><sup><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#984806;
background:white;text-decoration:none;text-underline:none"> Win10</span></sup></b></span><span class="MsoHyperlink"><b><sup><span style="color:#984806;background:white;
text-decoration:none;text-underline:none"><o:p></o:p></span></sup></b></span></p>
here is the code i use to grab the names and the links.
file = open(r"Path to HTML file", 'rb')
drivers = {}
rng_lst = [str(x) for x in range(5, 43)]
soup = bs4.BeautifulSoup(file)
table = soup.findAll('table')[0]
for row in table.findAll('tr'):
cells = row.findAll('td')
if len(cells) > 4:
cell_num = cells[1].get_text(strip=True)
if any(cell_num == n for n in rng_lst):
drv_name = cells[2].get_text(strip=True)
drivers[drv_name] = {'links': []}
links = cells[4].findAll('a')
for link in links:
drivers[drv_name]['links'].append(link.get('href'))
Upvotes: 0
Views: 341
Reputation:
os_cell = cells[4]
os_span = os_cell.find("span", class_="MsoHyperlink")
os = os_span.string
Upvotes: 0
Reputation: 26677
Assuming that string
contains the html content
from bs4 import BeautifulSoup
soup = BeautifulSoup(string)
for pTag in soup.find_all('p'):
anchorTag = pTag.findNext('a')
linkText = pTag.find('span', {'class' : 'MsoHyperlink' } ).span.text
print "LpTag.findNext('a')ink : ", anchorTag["href"]
print "Text to Link ", linkText
print
would give you an output
Link : LINK_TO_FILE
Text to Link Win8.1
Link : LINK_TO_FILE
Text to Link Win10
What it does?
By looking at the input string we can know that the anchors and text that we are interested in are present within the p
tags.
And that the text comes within a span
tag which inturn comes under another span
tag which occurs next to the anchor tag.
soup.find_all('p')
The find_all will return the list of p
tags.
pTag.findNext('a')
For each of the p
tag, the findNext will find the next occurence of the anchor tag. This anchor tag contains the relevant link
pTag.find('span', {'class' : 'MsoHyperlink' } )
The find will find the span
within the current p
tag with the attribute class
set as MsoHyperlink
.span
returns the span within the span returned bye find
.text
returns the text of the corresponding span
Upvotes: 1