mouse
mouse

Reputation: 35

Web scraping in Python - how to capture all <a> elements

I'm using beautifulsoup4 to scrape data from the lyrics.com website, specifically this link: https://www.lyrics.com/album/1447935.

From this block, I'm trying to extract both <a> elements:

[<table class="tdata">
    <colgroup>
        <col style="width: 50px;"/>
        <col style="width: 430px;"/>
        <col style="width: 80px;"/>
        <col style="width: 80px;"/>
    </colgroup>
    <thead>
        <tr>
            <th>#</th>
            <th>Song</th>
            <th>Duration</th>
            <th> </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td class="tal qx">1</td>
            <td class="tal qx">
                <strong>
                    <a href="/lyric/15183453/Make+You+Feel+My+Love">Make You Feel My Love</a>
                </strong>
            </td>
            <td class="tal qx">3:32</td>
            <td class="tal vam rt"> 
            </td></tr><tr><td class="tal qx">2</td>
            <td class="tal qx">
                <strong>
                    <a href="/lyric/15183454/Painting+Pictures">Painting Pictures</a>
                </strong>
            </td>
            <td class="tal qx">3:33</td>
            <td class="tal vam rt"> </td>
        </tr>
    </tbody>
</table>]

This is my code:

url = "http://www.lyrics.com" + album_url
page = r.get(url)
soup = bs(page.content, "html.parser")
songs = [a.get('href') for a in (table.find('a') for table in soup.findAll('table')) if a]

However, it's only returning the first <a>:

['/lyric/15183453/Make+You+Feel+My+Love']

What could be wrong?

Edit: Thank you all for the answers! I upvoted but I don't have enough rep for it to show

Upvotes: 1

Views: 161

Answers (4)

Norsk
Norsk

Reputation: 633

Other solutions work fine, however I prefer using good old selectors

from bs4 import BeautifulSoup as bs
import requests as req
page = req.get('https://www.lyrics.com/album/1447935')
soup = bs(page.content, 'html.parser')
links = soup.select('table.tdata a[href]')
print(links)

This will print

[<a href="/lyric/15183453/Make+You+Feel+My+Love">Make You Feel My Love</a>, <a href="/lyric/15183454/Painting+Pictures">Painting Pictures</a>]

If you aren't familiar with selectors, this will grab table elements that has the class tdata and then collect all the href property on the a elements

Upvotes: 1

teller.py3
teller.py3

Reputation: 844

This will work:

songs = [song['href'] for song in soup.select('table a')]

Output:

['/lyric/15183453/Make+You+Feel+My+Love', '/lyric/15183454/Painting+Pictures']

Upvotes: 1

mouse
mouse

Reputation: 35

Was able to make it work with:

for a in soup.findAll('a'):
    if a.parent.name == 'strong':
        if a.parent.parent.name == 'td':
            print(a["href"])

Still not sure why the other method doesn't work, though, since I've used it elsewhere in my program with no issues.

Upvotes: 1

BmoreAGG
BmoreAGG

Reputation: 160

Looks like you want table.findAll instead of table.find.

Upvotes: 0

Related Questions