andy mcevoy
andy mcevoy

Reputation: 398

How would I get the text AFTER a link in Python with BeautifulSoup?

I know how to go though and find all the links, but I want the text immediately after a link.

For example, in the given html:

<p><a href="/cgi-bin/bdquery/?&amp;Db=d106&amp;querybd=@FIELD(FLD004+@4((@1(Rep+Armey++Richard+K.))+00028))">Rep Armey, Richard K.</a> [TX-26]
 - 11/9/1999
<br/><a href="/cgi-bin/bdquery/?&amp;Db=d106&amp;querybd=@FIELD(FLD004+@4((@1(Rep+Davis++Thomas+M.))+00274))">Rep Davis, Thomas M.</a> [VA-11]
 - 11/9/1999
<br/><a href="/cgi-bin/bdquery/?&amp;Db=d106&amp;querybd=@FIELD(FLD004+@4((@1(Rep+DeLay++Tom))+00282))">Rep DeLay, Tom</a> [TX-22]
 - 11/9/1999

... (this repeats a number of times)

I want to extract the [CA-28] - 11/9/1999 that is associated with <a href=... >Rep Dreier, David</a>

and do this for all of the links in the list

Upvotes: 1

Views: 1504

Answers (2)

Iain Samuel McLean Elder
Iain Samuel McLean Elder

Reputation: 20984

findNextSibling is a robust and flexible way to do it.

The Setup

Use this to set up.

from BeautifulSoup import BeautifulSoup
from pprint import pprint

markup = '''
<p><a href="/cgi-bin/...00028))">Rep Armey, Richard K.</a> [TX-26]
 - 11/9/1999
<br/><a href="/cgi-bin/...00274))">Rep Davis, Thomas M.</a> [VA-11]
 - 11/9/1999
<br/><a href="/cgi-bin/...00282))">Rep DeLay, Tom</a> [TX-22]
 - 11/9/1999
 '''

soup = BeautifulSoup(markup)

What we do here:

  • Import BeautifulSoup to slurp the soup
  • Import pprint to inspect intermediate results with pretty-printing
  • Paste the sample markup (with hrefs truncated) into a variable
  • Slurp the markup so we can shred it

The hrefs are truncated for clarity. The result is the same on the original sample.

Find all the links

Call findAll with 'a':

links = soup.findAll('a')
pprint(links)

pprint shows the markup of each link.

[<a href="/cgi-bin/...00028))">Rep Armey, Richard K.</a>,
 <a href="/cgi-bin/...00274))">Rep Davis, Thomas M.</a>,
 <a href="/cgi-bin/...00282))">Rep DeLay, Tom</a>]

Get the text following an element

Call findNextSibling with text=True.

text_0 = links[0].findNextSibling(text=True)
pprint(text_0)

pprint shows the text following the first link, newlines encoded as \n.

u' [TX-26]\n - 11/9/1999\n'

Do it for all links

Use findNextSibling in a list comprehension to get the text following each link.

next_text = [ln.findNextSibling(text=True) for ln in links]
pprint(next_text)

pprint shows a list of the text, one item per link in the markup.

[u' [TX-26]\n - 11/9/1999\n',
 u' [VA-11]\n - 11/9/1999\n',
 u' [TX-22]\n - 11/9/1999\n ']

Upvotes: 4

DSM
DSM

Reputation: 353499

There may be a prettier way, but I usually chain .next:

>>> soup.find_all("a")
[<a href="/cgi-bin/bdquery/?&amp;Db=d106&amp;querybd=@FIELD(FLD004+@4((@1(Rep+Armey++Richard+K.))+00028))">Rep Armey, Richard K.</a>, <a href="/cgi-bin/bdquery/?&amp;Db=d106&amp;querybd=@FIELD(FLD004+@4((@1(Rep+Davis++Thomas+M.))+00274))">Rep Davis, Thomas M.</a>, <a href="/cgi-bin/bdquery/?&amp;Db=d106&amp;querybd=@FIELD(FLD004+@4((@1(Rep+DeLay++Tom))+00282))">Rep DeLay, Tom</a>]
>>> [a.next for a in soup.find_all("a")]
[u'Rep Armey, Richard K.', u'Rep Davis, Thomas M.', u'Rep DeLay, Tom']
>>> [a.next.next for a in soup.find_all("a")]
[u' [TX-26]\n - 11/9/1999\n', u' [VA-11]\n - 11/9/1999\n', u' [TX-22]\n - 11/9/1999']
>>> {a.next: a.next.next for a in soup.find_all("a")}
{u'Rep Davis, Thomas M.': u' [VA-11]\n - 11/9/1999\n', u'Rep DeLay, Tom': u' [TX-22]\n - 11/9/1999', u'Rep Armey, Richard K.': u' [TX-26]\n - 11/9/1999\n'}

etc.

Upvotes: 5

Related Questions