Pete
Pete

Reputation: 37

Extract url and title using beautifulsoup

I have the following code

html_doc = """

<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.283zh5uw21s47nefi4n2" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?283zh5uw21s47nefi4n2" title="Download link1.rar">Link1.rar</a>
</td>
<td class="normal">Size 1.62 MB</td>
</tr>
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.9hqarjfyw1tpowop9wxc" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?9hqarjfyw1tpowop9wxc" title="Download Link2.rar">Link2.rar</a>
</td>
<td class="normal">Size 297.56 MB</td>
</tr>



"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

all=soup.find_all("td",{"class":"normal alg"})

for item in all:
    a=str(item.find('a').contents[0])
    b=


How can I extract a and b for all results like

a= Link1.rar
b= https://example.com/qr.pl?do=0.283zh5uw21s47nefi4n2

I can either extract everything between or only the url but not both

thank you

Upvotes: 1

Views: 71

Answers (1)

KunduK
KunduK

Reputation: 33384

Try the following code.select all anchor tag and then get the text and href value

html_doc = """

<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.283zh5uw21s47nefi4n2" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?283zh5uw21s47nefi4n2" title="Download link1.rar">Link1.rar</a>
</td>
<td class="normal">Size 1.62 MB</td>
</tr>
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.9hqarjfyw1tpowop9wxc" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?9hqarjfyw1tpowop9wxc" title="Download Link2.rar">Link2.rar</a>
</td>
<td class="normal">Size 297.56 MB</td>
</tr>

"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

all=soup.select("a[title^='Download']")

for item in all:
        a=item.text
        b=item['href']
        print(a)
        print(b)

Or use this

html_doc = """

<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.283zh5uw21s47nefi4n2" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?283zh5uw21s47nefi4n2" title="Download link1.rar">Link1.rar</a>
</td>
<td class="normal">Size 1.62 MB</td>
</tr>
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.9hqarjfyw1tpowop9wxc" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?9hqarjfyw1tpowop9wxc" title="Download Link2.rar">Link2.rar</a>
</td>
<td class="normal">Size 297.56 MB</td>
</tr>

"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

all=soup.select("td.normal a[title^='Download']")

for item in all:
    a=item.text
    b=item['href']
    print(a)
    print(b)

Output:

Link1.rar
https://example.com/?283zh5uw21s47nefi4n2
Link2.rar
https://example.com/?9hqarjfyw1tpowop9wxc

Upvotes: 2

Related Questions