How to extract text, with link and text after the link and another text after br with python

Question

I have parsed the following string to BeautifulSoup to extract data out of it but I can't get some of the data out. Having tried different methods. I managed to get out the the text between the tag, the links and the text outside of each link.


 
  
   
    
     
      GOVERNOR:
     
    
    

   
   
    
     
      Robert 
                Bentley (R)*
     
    
    - Ex-Morgan County Commissioner & State Correctional Officer
    
     

     
      Stacy George 
                (R)
     
     - Ex-Morgan County Commissioner & State Correctional Officer
     

     Bob Starkey (R) - Retired Businessman, '10 State Rep. Candidate & '12 Scottsboro Mayor Candidate
     

     
      Kevin Bass (D)
     
     - Businessman & Ex-Pro Baseball Player
     

     
      Parker Griffith 
                (D)
     
     - Ex-Congressman, Ex-State Sen., Physician & Ex-Republican

Here is my implementation with BeautifulSoup

from bs4 import BeautifulSoup
soup = BeautifulSoup(Above_String)

"""for br in soup.find_all("br"):
    print br
    #print br.nextSibling.content
"""
for link in soup.find_all("a"):
    if link.string == None:
        print link.strong.string, link.get("href"),link.next_sibling
    else:
        print link.string, link.get("href"),link.next_sibling,link.next_sibling

The above code prints out something like this:

> Robert 
                Bentley (R)*
      http://governor.alabama.gov/ 

>      Stacy George 
                (R)
      http://www.facebook.com/stacy.george.3139 
     - Ex-Morgan County Commissioner & State Correctional Officer

>      Kevin Bass (D)
      http://www.bassforbama.com/ 
     - Businessman & Ex-Pro Baseball Player


>      Parker Griffith 
                (D)
      http://www.parkergriffithforcongress.com/ 
     - Ex-Congressman, Ex-State Sen., Physician & Ex-Republican

Missing out the third item which is

Bob Starkey (R) - Retired Businessman, '10 State Rep. Candidate & '12 Scottsboro Mayor Candidate

Please how do I get around this using BeautifulSoup? I have tried to do it with find_all("br") but it doesn't work as br tags return NoneType.

Martijn Pieters · Accepted Answer

Grab all text nodes beyond each link:

from itertools import takewhile
from bs4 import NavigableString

not_link = lambda t: getattr(t, 'name') not in ('a', 'strong')

for link in soup.find_all("a"):
    print 'Link contents:'
    text = link.text.strip()
    for sibling in takewhile(not_link, link.next_siblings):
        if isinstance(sibling, NavigableString):
            text += unicode(sibling).strip()
        else:
            text += sibling.text.strip()
    print text

This prints:

Link contents:
Robert 
                Bentley (R)*- Ex-Morgan County Commissioner & State Correctional Officer
Link contents:
Stacy George 
                (R)- Ex-Morgan County Commissioner & State Correctional OfficerBob Starkey (R) - Retired Businessman, '10 State Rep. Candidate & '12 Scottsboro Mayor Candidate
Link contents:
Kevin Bass (D)- Businessman & Ex-Pro Baseball Player
Link contents:
Parker Griffith 
                (D)- Ex-Congressman, Ex-State Sen., Physician & Ex-Republican

How to extract text, with link and text after the link and another text after br with python

Answers (1)

Related Questions