Reputation: 1378
I am trying to convert an html block to text using Python.
Input:
<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>
Desired output:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
I tried the html2text
module without much success:
#!/usr/bin/env python
import urllib2
import html2text
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com/page.html').read())
txt = soup.find('div', {'class' : 'body'})
print(html2text.html2text(txt))
The txt
object produces the html block above. I'd like to convert it to text and print it on the screen.
Upvotes: 88
Views: 206187
Reputation: 326
from lxml import html as html_module
def html_2_text(html_content):
tree = html_module.fromstring(html_content)
# text_list = tree.xpath('//text()')
# text_list = tree.xpath('//text()[not(ancestor::script)]')
text_list = tree.xpath('//text()[not(ancestor::script) and normalize-space()]')
text_list = [text.strip() for text in text_list]
return "\n".join(text for text in text_list if text!="")
Upvotes: 0
Reputation: 31801
A two-step lxml
-based approach with markup sanitizing before converting to plain text.
The script accepts either a path to an HTML file or piped stdin.
Will remove script blocks and all possibly undesired text. You can configure the lxml Cleaner instance to suit your needs.
#!/usr/bin/env python3
import sys
from pathlib import Path
from lxml import html
from lxml.html import tostring
from lxml.html.clean import Cleaner
def sanitize(dirty_html):
cleaner = Cleaner(page_structure=True,
meta=True,
embedded=True,
links=True,
style=True,
processing_instructions=True,
inline_style=True,
scripts=True,
javascript=True,
comments=True,
frames=True,
forms=True,
annoying_tags=True,
remove_unknown_tags=True,
safe_attrs_only=True,
safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
remove_tags=('span', 'font', 'div')
)
return cleaner.clean_html(dirty_html)
if __name__ == "__main__":
with Path(sys.argv[1]).open('a', encoding='utf-8') as fin:
source = fin.read()
source = sanitize(source)
source = source.replace('<br>', '\n')
tree = html.fromstring(source)
plain = tostring(tree, method='text', encoding=str)
print(plain)
Upvotes: 0
Reputation: 13
An updated answer based on Andreas' answer.
def parse_html(html):
elem = BeautifulSoup(html, features="html.parser")
text = ''
for e in elem.descendants:
if isinstance(e, str):
text += e.get_text().strip()
elif e.name in ['span']:
text += ' '
elif e.name in ['br', 'p', 'h1', 'h2', 'h3', 'h4', 'tr', 'th', 'div']:
text += '\n'
elif e.name == 'li':
text += '\n- '
return text
Why? Some XML code was still leaking inside, spans were stripped and didnt have enough space, and divs sometimes require more line breaks. Everything else is the same.
Upvotes: 1
Reputation: 496
I don't know who wrote this Library but, bless his/her heart.
Upvotes: 1
Reputation: 71
There is a library called inscripts really simple and light and can get its input from a file or directly from an URL:
from inscriptis import get_text
text = get_text(html)
print(text)
The output is:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Upvotes: 6
Reputation: 91
I encountered the same problem using Scrapy you may try adding this to settings.py
#settings.py
FEED_EXPORT_ENCODING = 'utf-8'
Upvotes: 0
Reputation: 1
from html.parser import HTMLParser
class HTMLFilter(HTMLParser):
text = ''
def handle_data(self, data):
self.text += f'{data}\n'
def html2text(html):
filter = HTMLFilter()
filter.feed(html)
return filter.text
content = html2text(content_temp)
Upvotes: -1
Reputation: 1204
I personally like Gazpacho solution by emehex, but it only use regular expression for filtering out the tags. No more magic. This means that solution keep text inside <style> and <script>.
So I would rather implement a simple solution based on regular expressions and use standard Python 3.4 library for unescape HTML entities:
import re
from html import unescape
def html_to_text(html):
# use non-greedy for remove scripts and styles
text = re.sub("<script.*?</script>", "", html, flags=re.DOTALL)
text = re.sub("<style.*?</style>", "", text, flags=re.DOTALL)
# remove other tags
text = re.sub("<[^>]+>", " ", text)
# strip whitespace
text = " ".join(text.split())
# unescape html entities
text = unescape(text)
return text
Of course, this does not error prove as BeautifulSoup or other parsers solutions. But you don't need any 3rd party package.
Upvotes: 1
Reputation: 1160
The main problem is how you keep some basic formatting. Here is my own minimal approach to keep new lines and bullets. I am sure it's not the solution to everything you want to keep but it's a starting point:
from bs4 import BeautifulSoup
def parse_html(html):
elem = BeautifulSoup(html, features="html.parser")
text = ''
for e in elem.descendants:
if isinstance(e, str):
text += e.strip()
elif e.name in ['br', 'p', 'h1', 'h2', 'h3', 'h4','tr', 'th']:
text += '\n'
elif e.name == 'li':
text += '\n- '
return text
The above adds a new line for 'br', 'p', 'h1', 'h2', 'h3', 'h4','tr', 'th'
and a new line with -
in front of text for li
elements
Upvotes: 11
Reputation: 36000
You can use a regular expression, but it's not recommended. The following code removes all the HTML tags in your data, giving you the text:
import re
data = """<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>"""
data = re.sub(r'<.*?>', '', data)
print(data)
Output
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Upvotes: 9
Reputation: 80406
soup.get_text()
outputs what you want:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(soup.get_text())
output:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
To keep newlines:
print(soup.get_text('\n'))
To be identical to your example, you can replace a newline with two newlines:
soup.get_text().replace('\n','\n\n')
Upvotes: 150
Reputation: 10558
gazpacho might be a good choice for this!
Input:
from gazpacho import Soup
html = """\
<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>
"""
Output:
text = Soup(html).strip(whitespace=False) # to keep "\n" characters intact
print(text)
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Upvotes: 2
Reputation: 514
There are some nice things here, and i might as well throw in my solution:
from html.parser import HTMLParser
def _handle_data(self, data):
self.text += data + '\n'
HTMLParser.handle_data = _handle_data
def get_html_text(html: str):
parser = HTMLParser()
parser.text = ''
parser.feed(html)
return parser.text.strip()
Upvotes: 4
Reputation: 23582
I liked @FrBrGeorge's no dependency answer so much that I expanded it to only extract the body
tag and added a convenience method so that HTML to text is a single line:
from abc import ABC
from html.parser import HTMLParser
class HTMLFilter(HTMLParser, ABC):
"""
A simple no dependency HTML -> TEXT converter.
Usage:
str_output = HTMLFilter.convert_html_to_text(html_input)
"""
def __init__(self, *args, **kwargs):
self.text = ''
self.in_body = False
super().__init__(*args, **kwargs)
def handle_starttag(self, tag: str, attrs):
if tag.lower() == "body":
self.in_body = True
def handle_endtag(self, tag):
if tag.lower() == "body":
self.in_body = False
def handle_data(self, data):
if self.in_body:
self.text += data
@classmethod
def convert_html_to_text(cls, html: str) -> str:
f = cls()
f.feed(html)
return f.text.strip()
See comment for usage.
This converts all of the text inside the body
, which in theory could include style
and script
tags. Further filtering could be achieved by extending the pattern of as shown for body
-- i.e. setting instance variables in_style
or in_script
.
Upvotes: 5
Reputation: 730
It's possible using python standard html.parser
:
from html.parser import HTMLParser
class HTMLFilter(HTMLParser):
text = ""
def handle_data(self, data):
self.text += data
f = HTMLFilter()
f.feed(data)
print(f.text)
Upvotes: 44
Reputation: 4073
It's possible to use BeautifulSoup to remove unwanted scripts and similar, though you may need to experiment with a few different sites to make sure you've covered the different types of things you wish to exclude. Try this:
from requests import get
from bs4 import BeautifulSoup as BS
response = get('http://news.bbc.co.uk/2/hi/health/2284783.stm')
soup = BS(response.content, "html.parser")
for child in soup.body.children:
if child.name == 'script':
child.decompose()
print(soup.body.get_text())
Upvotes: 1
Reputation: 19
I was in need of a way of doing this on a client's system without having to download additional libraries. I never found a good solution, so I created my own. Feel free to use this if you like.
import urllib
def html2text(strText):
str1 = strText
int2 = str1.lower().find("<body")
if int2>0:
str1 = str1[int2:]
int2 = str1.lower().find("</body>")
if int2>0:
str1 = str1[:int2]
list1 = ['<br>', '<tr', '<td', '</p>', 'span>', 'li>', '</h', 'div>' ]
list2 = [chr(13), chr(13), chr(9), chr(13), chr(13), chr(13), chr(13), chr(13)]
bolFlag1 = True
bolFlag2 = True
strReturn = ""
for int1 in range(len(str1)):
str2 = str1[int1]
for int2 in range(len(list1)):
if str1[int1:int1+len(list1[int2])].lower() == list1[int2]:
strReturn = strReturn + list2[int2]
if str1[int1:int1+7].lower() == '<script' or str1[int1:int1+9].lower() == '<noscript':
bolFlag1 = False
if str1[int1:int1+6].lower() == '<style':
bolFlag1 = False
if str1[int1:int1+7].lower() == '</style':
bolFlag1 = True
if str1[int1:int1+9].lower() == '</script>' or str1[int1:int1+11].lower() == '</noscript>':
bolFlag1 = True
if str2 == '<':
bolFlag2 = False
if bolFlag1 and bolFlag2 and (ord(str2) != 10) :
strReturn = strReturn + str2
if str2 == '>':
bolFlag2 = True
if bolFlag1 and bolFlag2:
strReturn = strReturn.replace(chr(32)+chr(13), chr(13))
strReturn = strReturn.replace(chr(9)+chr(13), chr(13))
strReturn = strReturn.replace(chr(13)+chr(32), chr(13))
strReturn = strReturn.replace(chr(13)+chr(9), chr(13))
strReturn = strReturn.replace(chr(13)+chr(13), chr(13))
strReturn = strReturn.replace(chr(13), '\n')
return strReturn
url = "http://www.theguardian.com/world/2014/sep/25/us-air-strikes-islamic-state-oil-isis"
html = urllib.urlopen(url).read()
print html2text(html)
Upvotes: 1
Reputation: 2713
The '\n'
places a newline between the paragraphs.
from bs4 import Beautifulsoup
soup = Beautifulsoup(text)
print(soup.get_text('\n'))
Upvotes: 6