Reputation: 15940
I'm trying to translate an online html page into text.
I have a problem with this structure:
<div align="justify"><b>Available in
<a href="http://www.example.com.be/book.php?number=1">
French</a> and
<a href="http://www.example.com.be/book.php?number=5">
English</a>.
</div>
Here is its representation as a python string:
'<div align="justify"><b>Available in \r\n<a href="http://www.example.com.be/book.php?number=1">\r\nFrench</a>; \r\n<a href="http://www.example.com.be/book.php?number=5">\r\nEnglish</a>.\r\n</div>'
When using:
html_content = get_html_div_from_above()
para = BeautifulSoup(html_content)
txt = para.text
BeautifulSoup translate it (in the 'txt' variable) as:
u'Available inFrenchandEnglish.'
It probably strips each line in the original html string.
Do you have a clean solution about this problem ?
Thanks.
Upvotes: 1
Views: 311
Reputation: 15940
I finally got a good solution:
def clean_line(line):
return re.sub(r'[ ]{2,}', ' ', re.sub(r'[\r\n]', '', line))
html_content = get_html_div_from_above()
para = BeautifulSoup(html_content)
''.join([clean_line(line) for line in para.findAll(text=True)])
Which outputs:
u'Available in French and English. '
Upvotes: 2
Reputation: 15940
I got a solution:
html_content = get_html_div_from_above()
para = BeautifulSoup(html_content)
txt = para.getText(separator=' ')
But it's not optimal because it puts spaces between each tag:
u'Available in French and English . '
Notice the space before the dot.
Upvotes: 1