web scraping using beautifulsoup: separating values

Question

I do web scraping using beautifulsoup. The web page has the following source:


                            Courtney, John  (Dem)                       ,

                            Clinton, Hilary  (Dem)                      ,

                            Lee, Kevin  (Rep)                       ,

The following codes work.

for item in soup.find_all("a"):
    print item

But, the codes return the following:

Courtney, John  (Dem)
Clinton, Hilary  (Dem)
Lee, Kevin  (Rep)

Can I just collect the names only? then the affiliation information separately? Thanks in advance.

gtlambert · Accepted Answer

You could use:

from bs4 import BeautifulSoup

content = '''
Courtney, John  (Dem)
Clinton, Hilary  (Dem),
Lee, Kevin  (Rep)
'''

politicians = []
soup = BeautifulSoup(content)
for item in soup.find_all('a'):
    name, party = item.text.strip().rsplit('(')
    politicians.append((name.strip(), party.strip()[:-1]))

Because the names and the affiliation information both make up the text content of the a tags, you can't collect them separately. You have to collect them together as a string, then separate them. I have used the strip() function to remove unwanted whitespace, and the rsplit('(') function to split the text content on the occurrence of the left bracket.

Output

print(politicians)
[(u'Courtney, John', u'Dem)'),
 (u'Clinton, Hilary', u'Dem)'),
 (u'Lee, Kevin', u'Rep)')]

web scraping using beautifulsoup: separating values

Answers (2)

Related Questions