How to prevent BeautifulSoup from stripping lines

Question

I'm trying to translate an online html page into text.

I have a problem with this structure:

Available in  

French and 

English.

Here is its representation as a python string:

'Available in French; English. '

When using:

html_content = get_html_div_from_above() para = BeautifulSoup(html_content) txt = para.text

BeautifulSoup translate it (in the 'txt' variable) as:

u'Available inFrenchandEnglish.'

It probably strips each line in the original html string.

Do you have a clean solution about this problem ?

Thanks.

Oli · Accepted Answer

I finally got a good solution:

def clean_line(line):
    return re.sub(r'[ ]{2,}', ' ', re.sub(r'[
]', '', line))

html_content = get_html_div_from_above()
para = BeautifulSoup(html_content)
''.join([clean_line(line) for line in para.findAll(text=True)])

Which outputs:

u'Available in French and English.  '

How to prevent BeautifulSoup from stripping lines

Answers (2)

Related Questions