Reputation: 37
I have the following code
html_doc = """
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.283zh5uw21s47nefi4n2" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?283zh5uw21s47nefi4n2" title="Download link1.rar">Link1.rar</a>
</td>
<td class="normal">Size 1.62 MB</td>
</tr>
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.9hqarjfyw1tpowop9wxc" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?9hqarjfyw1tpowop9wxc" title="Download Link2.rar">Link2.rar</a>
</td>
<td class="normal">Size 297.56 MB</td>
</tr>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
all=soup.find_all("td",{"class":"normal alg"})
for item in all:
a=str(item.find('a').contents[0])
b=
How can I extract a and b for all results like
a= Link1.rar
b= https://example.com/qr.pl?do=0.283zh5uw21s47nefi4n2
I can either extract everything between or only the url but not both
thank you
Upvotes: 1
Views: 71
Reputation: 33384
Try the following code.select all anchor
tag and then get the text
and href
value
html_doc = """
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.283zh5uw21s47nefi4n2" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?283zh5uw21s47nefi4n2" title="Download link1.rar">Link1.rar</a>
</td>
<td class="normal">Size 1.62 MB</td>
</tr>
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.9hqarjfyw1tpowop9wxc" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?9hqarjfyw1tpowop9wxc" title="Download Link2.rar">Link2.rar</a>
</td>
<td class="normal">Size 297.56 MB</td>
</tr>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
all=soup.select("a[title^='Download']")
for item in all:
a=item.text
b=item['href']
print(a)
print(b)
Or use this
html_doc = """
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.283zh5uw21s47nefi4n2" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?283zh5uw21s47nefi4n2" title="Download link1.rar">Link1.rar</a>
</td>
<td class="normal">Size 1.62 MB</td>
</tr>
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.9hqarjfyw1tpowop9wxc" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?9hqarjfyw1tpowop9wxc" title="Download Link2.rar">Link2.rar</a>
</td>
<td class="normal">Size 297.56 MB</td>
</tr>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
all=soup.select("td.normal a[title^='Download']")
for item in all:
a=item.text
b=item['href']
print(a)
print(b)
Output:
Link1.rar
https://example.com/?283zh5uw21s47nefi4n2
Link2.rar
https://example.com/?9hqarjfyw1tpowop9wxc
Upvotes: 2