Reputation: 398
I know how to go though and find all the links, but I want the text immediately after a link.
For example, in the given html:
<p><a href="/cgi-bin/bdquery/?&Db=d106&querybd=@FIELD(FLD004+@4((@1(Rep+Armey++Richard+K.))+00028))">Rep Armey, Richard K.</a> [TX-26]
- 11/9/1999
<br/><a href="/cgi-bin/bdquery/?&Db=d106&querybd=@FIELD(FLD004+@4((@1(Rep+Davis++Thomas+M.))+00274))">Rep Davis, Thomas M.</a> [VA-11]
- 11/9/1999
<br/><a href="/cgi-bin/bdquery/?&Db=d106&querybd=@FIELD(FLD004+@4((@1(Rep+DeLay++Tom))+00282))">Rep DeLay, Tom</a> [TX-22]
- 11/9/1999
... (this repeats a number of times)
I want to extract the [CA-28] - 11/9/1999
that is associated with <a href=... >Rep Dreier, David</a>
and do this for all of the links in the list
Upvotes: 1
Views: 1504
Reputation: 20984
findNextSibling is a robust and flexible way to do it.
The Setup
Use this to set up.
from BeautifulSoup import BeautifulSoup
from pprint import pprint
markup = '''
<p><a href="/cgi-bin/...00028))">Rep Armey, Richard K.</a> [TX-26]
- 11/9/1999
<br/><a href="/cgi-bin/...00274))">Rep Davis, Thomas M.</a> [VA-11]
- 11/9/1999
<br/><a href="/cgi-bin/...00282))">Rep DeLay, Tom</a> [TX-22]
- 11/9/1999
'''
soup = BeautifulSoup(markup)
What we do here:
The hrefs are truncated for clarity. The result is the same on the original sample.
Find all the links
Call findAll with 'a':
links = soup.findAll('a')
pprint(links)
pprint shows the markup of each link.
[<a href="/cgi-bin/...00028))">Rep Armey, Richard K.</a>,
<a href="/cgi-bin/...00274))">Rep Davis, Thomas M.</a>,
<a href="/cgi-bin/...00282))">Rep DeLay, Tom</a>]
Get the text following an element
Call findNextSibling with text=True.
text_0 = links[0].findNextSibling(text=True)
pprint(text_0)
pprint shows the text following the first link, newlines encoded as \n
.
u' [TX-26]\n - 11/9/1999\n'
Do it for all links
Use findNextSibling in a list comprehension to get the text following each link.
next_text = [ln.findNextSibling(text=True) for ln in links]
pprint(next_text)
pprint shows a list of the text, one item per link in the markup.
[u' [TX-26]\n - 11/9/1999\n',
u' [VA-11]\n - 11/9/1999\n',
u' [TX-22]\n - 11/9/1999\n ']
Upvotes: 4
Reputation: 353499
There may be a prettier way, but I usually chain .next
:
>>> soup.find_all("a")
[<a href="/cgi-bin/bdquery/?&Db=d106&querybd=@FIELD(FLD004+@4((@1(Rep+Armey++Richard+K.))+00028))">Rep Armey, Richard K.</a>, <a href="/cgi-bin/bdquery/?&Db=d106&querybd=@FIELD(FLD004+@4((@1(Rep+Davis++Thomas+M.))+00274))">Rep Davis, Thomas M.</a>, <a href="/cgi-bin/bdquery/?&Db=d106&querybd=@FIELD(FLD004+@4((@1(Rep+DeLay++Tom))+00282))">Rep DeLay, Tom</a>]
>>> [a.next for a in soup.find_all("a")]
[u'Rep Armey, Richard K.', u'Rep Davis, Thomas M.', u'Rep DeLay, Tom']
>>> [a.next.next for a in soup.find_all("a")]
[u' [TX-26]\n - 11/9/1999\n', u' [VA-11]\n - 11/9/1999\n', u' [TX-22]\n - 11/9/1999']
>>> {a.next: a.next.next for a in soup.find_all("a")}
{u'Rep Davis, Thomas M.': u' [VA-11]\n - 11/9/1999\n', u'Rep DeLay, Tom': u' [TX-22]\n - 11/9/1999', u'Rep Armey, Richard K.': u' [TX-26]\n - 11/9/1999\n'}
etc.
Upvotes: 5