Michael T
Michael T

Reputation: 1955

Parse 'a' tags based on attribute using Python and BeautifulSoup

Using this bit of html:

    <td align="left">
     <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/2000032">
      Russell, Addison
     </a>
     SS OAK  - Won at $0
     <br>
      <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/556425">
       Vargas, Jason
      </a>
      SP LAA
      <span title="Angels interested in bringing back Jason Vargas">
       <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/556425" subtab="Update">
        <img border="0" height="10" src="http://sports.cbsimg.net/images/news-note-recent.gif" width="10"/>
       </a>
      </span>
      - Dropped
     </br>
    </td>

I want to only show the blocks if they do not have subtab = "Update". But I haven't been able to figure out how to refer to the subtab in a Python loop using BeautifulSoup. Here's what I attempted:

        soup = BeautifulSoup(html)
        pl = soup.findAll('a',{'class': 'playerLink'})
        for a in pl:
            if a.subtab == "Update":
                print "UPDATE"
            else:
                print "Player Name: " + a.text

I also tried referring to the subtype in the findAll part:

        pl = soup.findAll('a',{'class': 'playerLink'}, {'subtype':0})

Neither of these ways works. My problem is, the class is 'playerLink' in all cases, so that subtype is the only way I can distinguish it. I'm very new to BS so I'm not too good at handling tags and attributes. In the second example, maybe it would work if I only wanted subtype=Update, but I want every a tag where the subtype does not exist.

Upvotes: 2

Views: 2244

Answers (5)

Mira Amalina
Mira Amalina

Reputation: 1

You can try this :

containers = page_soup.findAll("a", {"class":"playerLink"})
for container in containers:
      url = ("<a href='%s'>%s</a>" %(container.get("href"), container.a))

Upvotes: 0

jfs
jfs

Reputation: 414235

a.attrs returns <a>'s attributes as a dictionary. You could check whether <a> tag has no subtab attribute using 'subtab' not in a.attrs:

from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4

player_links = SoupStrainer('a', 'playerLink')
soup = BeautifulSoup(html, parse_only=player_links)
names = [a.get_text().strip()
         for a in soup.find_all(player_links) if 'subtab' not in a.attrs]
print(names)
# -> ['Russell, Addison', 'Vargas, Jason']

I can't find where it is mentioned in the documentation but it seems that specifying subtab=False also works to exclude any tag that has subtab attribute:

from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4

player_links = SoupStrainer('a', 'playerLink', subtab=False)
soup = BeautifulSoup(html, parse_only=player_links)
names = [a.get_text().strip()
         for a in soup.find_all(player_links)]
print(names)

If found tags (player_links) are not nested then you could omit .find_all(player_links) call:

from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4

player_links = SoupStrainer('a', 'playerLink', subtab=False)
soup = BeautifulSoup(html, parse_only=player_links)
names = [a.get_text().strip() for a in soup]
print(names)

Upvotes: 2

Birei
Birei

Reputation: 36262

You can use getattr() function to check if an element has an attribute:

from bs4 import BeautifulSoup
import sys

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')

for a in soup.find_all('a', attrs={'class': 'playerLink'}):
    #if getattr(a, 'subtab'): continue
    if a.get('subtab'): continue
    print(a.get_text("", strip=True))

Run it like:

python3 script.py htmlfile

It yields:

Russell, Addison
Vargas, Jason

Upvotes: 2

Michael T
Michael T

Reputation: 1955

Messing around with the attrs function I found out this works:

if str(a.attrs).find('subtab') > 0

It probably isn't the cleanest way to do it, but it works.

Upvotes: 0

qmorgan
qmorgan

Reputation: 4894

A simple but not particularly elegant solution is simply to search for the string 'subtab' in each element:

for a in pl:
    if 'subtab' in a.prettify():
        print "UPDATE"
    else:
        print "Player Name: " + a.text

Upvotes: 1

Related Questions