Michael T
Michael T

Reputation: 1965

Parse HTML with Python and BeautifulSoup - get text both inside and outside the <a> tags

I have html with a number of tags, and then text which is outside those tags. The text I'm trying to get is in
tags except the first instance, which is I guess just part of the tag. But if I try to get the text of the tag (like td.text or something like that) then it also gives me all the text in all the and
tags.

    <td align="left">
     <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1740935">
      Garcia, Leury
     </a>
     SS CHW - Traded from Royal Disappointments
     <br>
      <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1813191">
       Almonte, Abraham
      </a>
      OF SEA - Traded from Royal Disappointments
      <br>
       <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/2046044">
        Pillar, Kevin
       </a>
       OF TOR - Traded from Royal Disappointments
       <br>
        <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1666824">
         Sierra, Moises
        </a>
        LF TOR - Traded from Royal Disappointments
        <br>
         <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/580599">
          Paulino, Felipe
         </a>
         SP KC
         <span title="Felipe Paulino off 60-day DL">
          <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/580599" subtab="Update">
           <img border="0" height="10" src="http://sports.cbsimg.net/images/news-note-recent.gif" width="10"/>
          </a>
         </span>
         - Traded from Royal Disappointments
        </br>
       </br>
      </br>
     </br>
    </td>

Basically I want (as separate values) each text in an a tag, followed by each text outside the a tag. So the end result would be:

Garcia, Leury

SS CHW - Traded from Royal Disappointments

Almonte, Abraham

OF SEA - Traded from Royal Disappointments

Pillar, Kevin

OF TOR - Traded from Royal Disappointments

Sierra, Moises

LF TOR - Traded from Royal Disappointments

Paulino, Felipe

SP KC - Traded from Royal Disappointments

So far I only have the code for the text from the a tags:

        pl = psoup.findAll('a',{'class': 'playerLink'})
        for a in pl:          
            print a.text

I really have no idea how to approach the rest of it.

Upvotes: 2

Views: 3366

Answers (2)

Balthazar Rouberol
Balthazar Rouberol

Reputation: 7180

You can use the Tag.next property (which aliases Tag.next_element):

for a in psoup('a': {'class': 'playerLink'}):
    print a.text
    print a.next.next

Indeed, each "outside" text is the second element after a link (the first element being the link anchor).

Upvotes: 2

user2926055
user2926055

Reputation: 1991

What about just calling get_text on psoup?

(Pdb) print soup.get_text()


      Garcia, Leury

     SS CHW - Traded from Royal Disappointments


       Almonte, Abraham

      OF SEA - Traded from Royal Disappointments


        Pillar, Kevin

       OF TOR - Traded from Royal Disappointments


         Sierra, Moises

        LF TOR - Traded from Royal Disappointments


          Paulino, Felipe

         SP KC





         - Traded from Royal Disappointments

Upvotes: 2

Related Questions