Josh Lee
Josh Lee

Reputation: 101

How can I remove spaces in between HTML tags using BeautifulSoup in Python?

I have the following problem: when there is a space in between html tags, my code does not give me the text I want outputted.

Instead of outputting:

year|salary|bonus
2005|100,000|50,000
2006|120,000|80,000

I get this instead:

 |salary|bonus
2005|100,000|50,000
2006|120,000|80,000

the text "year" is not outputted.

Here's my code:

from BeautifulSoup import BeautifulSoup
import re


html = '<html><body><table><tr><td> <p>year</p></td><td><p>salary</p></td><td>bonus</td></tr><tr><td>2005</td><td>100,000</td><td>50,000</td></tr><tr><td>2006</td><td>120,000</td><td>80,000</td></tr></table></html>'
soup = BeautifulSoup(html)
table = soup.find('table')
rows = table.findAll('tr')

store=[]

for tr in rows:
    cols = tr.findAll('td')
    row = []
    for td in cols:
        try:
            row.append(''.join(td.find(text=True)))
        except Exception:
            row.append('')
    store.append('|'.join(filter(None, row)))
print '\n'.join(store)

The problem comes from the space in:

"<td> <p>year</p></td>"

Is there a way to get rid of that space when I pull up some html from the web?

Upvotes: 1

Views: 2602

Answers (3)

crax
crax

Reputation: 556

use

html = re.sub(r'\s\s+', '', html)

Upvotes: 1

Mahmoud Abdelkader
Mahmoud Abdelkader

Reputation: 24939

As @Herman suggested, you should use Tag.text to find the relevant text for the tag you're currently parsing.

A bit more detail on why Tag.find() didn't do what you want: BeautifulSoup's Tag.find() is very similar to to Tag.findAll(), in fact, its implementation of Tag.find() just invokes Tag.findAll() with a keyword argument limit, set to 1. Tag.findAll() then recursively descends down the tag tree and returns once it finds some text that satisfies the text argument. Since you set text to True, the character "u' '" technically satisfies this condition and, thus, is what is returned by Tag.find().

In fact, you can see that year is returned if you print out td.findAll(text=True, limit=2). You can also set text to a regular expression to ignore spaces, so you can then do td.find(text=re.compile('[\S\w]')).

I also noticed that you're using store.append('|'.join(filter(None, row))). I think you should use the CSV module, particularly the csv.writer. The CSV module handles all the problems that you might face if you have a pipe somewhere in your parsed html files, and, makes your code much cleaner.

Here's an example:

import csv
import re
from cStringIO import StringIO

from BeautifulSoup import BeautifulSoup


html = ('<html><body><table><tr><td> <p>year</p></td><td><p>salary</p></td>'
        '<td>bonus</td></tr><tr><td>2005</td><td>100,000</td><td>50,000</td>'
        '</tr><tr><td>2006</td><td>120,000</td><td>80,000</td></tr></table>'
        '</html>')
soup = BeautifulSoup(html)
table = soup.find('table')
rows = table.findAll('tr')

output = StringIO()
writer = csv.writer(output, delimiter='|')

for tr in rows:
    cols = tr.findAll('td')
    row = []
    for td in cols:
        row.append(td.text)

    writer.writerow(filter(None, row))

print output.getvalue()

And the output is:

year|salary|bonus
2005|100,000|50,000
2006|120,000|80,000

Upvotes: 1

Herman Schaaf
Herman Schaaf

Reputation: 48445

Instead of row.append(''.join(td.find(text=True))), use :

row.append(''.join(td.text))

Output:

year|salary|bonus
2005|100,000|50,000
2006|120,000|80,000

Upvotes: 5

Related Questions