Reputation: 35311
I'm looking for the best way to convert HTML to text, using only modules from the Python 2.7.x standard library. (I.e., no BeautifulSoup
, etc.)
By HTML-to-text conversion I mean the moral equivalent of lynx -dump
. In fact, just getting rid of HTML tags intelligently, and converting all HTML-entities to ASCII (or to UTF8-encoded unicode), would suffice.
No regex-based answers, please. (Regexes are not up to the task.)
Thanks!
Upvotes: 1
Views: 1722
Reputation: 53
I wrote a really simple python script that extracts headings and paragraphs only from HTML files without using any third-party Libraries. Note: This script is really simple and can only handle really simple HTML. And its written in python 3
#!/usr/bin/env python3
import os
#This is a standard python module
headings = "<h1>"
paragraphs = "<p>"
f = open('filename.html')
f.close
for line in f:
if headings in line:
print ("line")
If paragraphs in line:
print ("line")
You can still expand on this idea and make it extract more stuff from the HTML file.
Upvotes: 0
Reputation: 27
I would also suggest that you should take a look at html2text.
Also take a look at another thread
Upvotes: -1
Reputation: 134601
Python since 2.2 has HTMLParser module. It's not the most efficient nor the easiest use, but it's there...
And if you're dealing with proper XHTML (or you can pass it through Tidy), you can use much better ElementTree
from xml.etree.ElementTree import ElementTree
tree = ElementTree()
tree.parse("your_document.xhtml")
your_string = tree.tostring(method="text", encoding="utf-8")
Upvotes: 5