Reputation: 1965
I have html with a number of tags, and then text which is outside those tags. The text I'm trying to get is in
tags except the first instance, which is I guess just part of the tag. But if I try to get the text of the tag (like td.text or something like that) then it also gives me all the text in all the and
tags.
<td align="left">
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1740935">
Garcia, Leury
</a>
SS CHW - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1813191">
Almonte, Abraham
</a>
OF SEA - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/2046044">
Pillar, Kevin
</a>
OF TOR - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1666824">
Sierra, Moises
</a>
LF TOR - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/580599">
Paulino, Felipe
</a>
SP KC
<span title="Felipe Paulino off 60-day DL">
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/580599" subtab="Update">
<img border="0" height="10" src="http://sports.cbsimg.net/images/news-note-recent.gif" width="10"/>
</a>
</span>
- Traded from Royal Disappointments
</br>
</br>
</br>
</br>
</td>
Basically I want (as separate values) each text in an a tag, followed by each text outside the a tag. So the end result would be:
Garcia, Leury
SS CHW - Traded from Royal Disappointments
Almonte, Abraham
OF SEA - Traded from Royal Disappointments
Pillar, Kevin
OF TOR - Traded from Royal Disappointments
Sierra, Moises
LF TOR - Traded from Royal Disappointments
Paulino, Felipe
SP KC - Traded from Royal Disappointments
So far I only have the code for the text from the a tags:
pl = psoup.findAll('a',{'class': 'playerLink'})
for a in pl:
print a.text
I really have no idea how to approach the rest of it.
Upvotes: 2
Views: 3366
Reputation: 7180
You can use the Tag.next
property (which aliases Tag.next_element
):
for a in psoup('a': {'class': 'playerLink'}):
print a.text
print a.next.next
Indeed, each "outside" text is the second element after a link (the first element being the link anchor).
Upvotes: 2
Reputation: 1991
What about just calling get_text on psoup
?
(Pdb) print soup.get_text()
Garcia, Leury
SS CHW - Traded from Royal Disappointments
Almonte, Abraham
OF SEA - Traded from Royal Disappointments
Pillar, Kevin
OF TOR - Traded from Royal Disappointments
Sierra, Moises
LF TOR - Traded from Royal Disappointments
Paulino, Felipe
SP KC
- Traded from Royal Disappointments
Upvotes: 2