Reputation: 2014
I do web scraping using beautifulsoup. The web page has the following source:
<a href="/en/Members/">
Courtney, John (Dem) </a>,
<a href="/en/Members/">
Clinton, Hilary (Dem) </a>,
<a href="/en/Members/">
Lee, Kevin (Rep) </a>,
The following codes work.
for item in soup.find_all("a"):
print item
But, the codes return the following:
Courtney, John (Dem)
Clinton, Hilary (Dem)
Lee, Kevin (Rep)
Can I just collect the names only? then the affiliation information separately? Thanks in advance.
Upvotes: 1
Views: 98
Reputation: 5885
You can use re.split()
to split a string on multiple delimiters by crafting a regular expression pattern to split on. Here I split on (
and )
import re
for item in soup.find_all("a"):
tokens = re.split('\(|\)', item)
name = tokens[0].strip()
affiliation = tokens[1].strip()
print name
print affiliation
Source: https://docs.python.org/2/library/re.html#re.split
re.split()
will return a list that looks like this:
>>> re.split('\(|\)', item)
['Courtney, John ', 'Dem', '']
Grab entry 0
from the list for the name, stripping off white space from the ends. Grab entry 1
for the affiliation, doing the same.
Upvotes: 1
Reputation: 11971
You could use:
from bs4 import BeautifulSoup
content = '''
<a href="/en/Members/">Courtney, John (Dem)</a>
<a href="/en/Members/">Clinton, Hilary (Dem)</a>,
<a href="/en/Members/">Lee, Kevin (Rep)</a>
'''
politicians = []
soup = BeautifulSoup(content)
for item in soup.find_all('a'):
name, party = item.text.strip().rsplit('(')
politicians.append((name.strip(), party.strip()[:-1]))
Because the names and the affiliation information both make up the text content of the a
tags, you can't collect them separately. You have to collect them together as a string, then separate them. I have used the strip()
function to remove unwanted whitespace, and the rsplit('(')
function to split the text content on the occurrence of the left bracket.
Output
print(politicians)
[(u'Courtney, John', u'Dem)'),
(u'Clinton, Hilary', u'Dem)'),
(u'Lee, Kevin', u'Rep)')]
Upvotes: 1