Reputation: 101
I have the following problem: when there is a space in between html tags, my code does not give me the text I want outputted.
Instead of outputting:
year|salary|bonus
2005|100,000|50,000
2006|120,000|80,000
I get this instead:
|salary|bonus
2005|100,000|50,000
2006|120,000|80,000
the text "year" is not outputted.
Here's my code:
from BeautifulSoup import BeautifulSoup
import re
html = '<html><body><table><tr><td> <p>year</p></td><td><p>salary</p></td><td>bonus</td></tr><tr><td>2005</td><td>100,000</td><td>50,000</td></tr><tr><td>2006</td><td>120,000</td><td>80,000</td></tr></table></html>'
soup = BeautifulSoup(html)
table = soup.find('table')
rows = table.findAll('tr')
store=[]
for tr in rows:
cols = tr.findAll('td')
row = []
for td in cols:
try:
row.append(''.join(td.find(text=True)))
except Exception:
row.append('')
store.append('|'.join(filter(None, row)))
print '\n'.join(store)
The problem comes from the space in:
"<td> <p>year</p></td>"
Is there a way to get rid of that space when I pull up some html from the web?
Upvotes: 1
Views: 2602
Reputation: 24939
As @Herman suggested, you should use Tag.text
to find the relevant text
for the tag you're currently parsing.
A bit more detail on why Tag.find()
didn't do what you want: BeautifulSoup's
Tag.find()
is very similar to to Tag.findAll()
, in fact, its implementation
of Tag.find()
just invokes Tag.findAll()
with a keyword argument limit, set
to 1. Tag.findAll()
then recursively descends down the tag tree and returns
once it finds some text that satisfies the text
argument. Since you set text
to True
, the character "u' '" technically satisfies this condition and, thus,
is what is returned by Tag.find()
.
In fact, you can see that year is returned if you print out td.findAll(text=True, limit=2)
. You can also set text
to a regular expression to ignore spaces, so you can then do td.find(text=re.compile('[\S\w]'))
.
I also noticed that you're using store.append('|'.join(filter(None, row)))
. I
think you should use the CSV module, particularly the csv.writer. The CSV module handles all the problems that you might face if you have a pipe somewhere in your parsed html files, and, makes your code much cleaner.
Here's an example:
import csv
import re
from cStringIO import StringIO
from BeautifulSoup import BeautifulSoup
html = ('<html><body><table><tr><td> <p>year</p></td><td><p>salary</p></td>'
'<td>bonus</td></tr><tr><td>2005</td><td>100,000</td><td>50,000</td>'
'</tr><tr><td>2006</td><td>120,000</td><td>80,000</td></tr></table>'
'</html>')
soup = BeautifulSoup(html)
table = soup.find('table')
rows = table.findAll('tr')
output = StringIO()
writer = csv.writer(output, delimiter='|')
for tr in rows:
cols = tr.findAll('td')
row = []
for td in cols:
row.append(td.text)
writer.writerow(filter(None, row))
print output.getvalue()
And the output is:
year|salary|bonus
2005|100,000|50,000
2006|120,000|80,000
Upvotes: 1
Reputation: 48445
Instead of row.append(''.join(td.find(text=True)))
, use :
row.append(''.join(td.text))
Output:
year|salary|bonus
2005|100,000|50,000
2006|120,000|80,000
Upvotes: 5