kevin
kevin

Reputation: 2014

web scraping using beautifulsoup: separating values

I do web scraping using beautifulsoup. The web page has the following source:

<a href="/en/Members/">
                            Courtney, John  (Dem)                       </a>,
<a href="/en/Members/">
                            Clinton, Hilary  (Dem)                      </a>,
<a href="/en/Members/">
                            Lee, Kevin  (Rep)                       </a>,

The following codes work.

for item in soup.find_all("a"):
    print item

But, the codes return the following:

Courtney, John  (Dem)
Clinton, Hilary  (Dem)
Lee, Kevin  (Rep)

Can I just collect the names only? then the affiliation information separately? Thanks in advance.

Upvotes: 1

Views: 98

Answers (2)

Joe Young
Joe Young

Reputation: 5885

You can use re.split() to split a string on multiple delimiters by crafting a regular expression pattern to split on. Here I split on ( and )

import re

for item in soup.find_all("a"):
    tokens = re.split('\(|\)', item)
    name = tokens[0].strip()
    affiliation = tokens[1].strip()
    print name
    print affiliation

Source: https://docs.python.org/2/library/re.html#re.split

re.split() will return a list that looks like this:

>>> re.split('\(|\)', item)
['Courtney, John  ', 'Dem', '']

Grab entry 0 from the list for the name, stripping off white space from the ends. Grab entry 1 for the affiliation, doing the same.

Upvotes: 1

gtlambert
gtlambert

Reputation: 11971

You could use:

from bs4 import BeautifulSoup

content = '''
<a href="/en/Members/">Courtney, John  (Dem)</a>
<a href="/en/Members/">Clinton, Hilary  (Dem)</a>,
<a href="/en/Members/">Lee, Kevin  (Rep)</a>
'''

politicians = []
soup = BeautifulSoup(content)
for item in soup.find_all('a'):
    name, party = item.text.strip().rsplit('(')
    politicians.append((name.strip(), party.strip()[:-1])) 

Because the names and the affiliation information both make up the text content of the a tags, you can't collect them separately. You have to collect them together as a string, then separate them. I have used the strip() function to remove unwanted whitespace, and the rsplit('(') function to split the text content on the occurrence of the left bracket.

Output

print(politicians)
[(u'Courtney, John', u'Dem)'),
 (u'Clinton, Hilary', u'Dem)'),
 (u'Lee, Kevin', u'Rep)')]

Upvotes: 1

Related Questions