kjo
kjo

Reputation: 35311

html-to-text conversion using Python standard library only

I'm looking for the best way to convert HTML to text, using only modules from the Python 2.7.x standard library. (I.e., no BeautifulSoup, etc.)

By HTML-to-text conversion I mean the moral equivalent of lynx -dump. In fact, just getting rid of HTML tags intelligently, and converting all HTML-entities to ASCII (or to UTF8-encoded unicode), would suffice.

No regex-based answers, please. (Regexes are not up to the task.)

Thanks!

Upvotes: 1

Views: 1722

Answers (3)

Mohamed Technology
Mohamed Technology

Reputation: 53

I wrote a really simple python script that extracts headings and paragraphs only from HTML files without using any third-party Libraries. Note: This script is really simple and can only handle really simple HTML. And its written in python 3

#!/usr/bin/env python3
import os
#This is a standard python module
headings = "<h1>"
paragraphs = "<p>"



f = open('filename.html')
f.close

for line in f: 
   if headings in line:
      print ("line")
   If paragraphs in line:
     print ("line")

You can still expand on this idea and make it extract more stuff from the HTML file.

Upvotes: 0

kiran
kiran

Reputation: 27

I would also suggest that you should take a look at html2text.
Also take a look at another thread

Upvotes: -1

vartec
vartec

Reputation: 134601

Python since 2.2 has HTMLParser module. It's not the most efficient nor the easiest use, but it's there...

And if you're dealing with proper XHTML (or you can pass it through Tidy), you can use much better ElementTree

from xml.etree.ElementTree import ElementTree
tree = ElementTree()
tree.parse("your_document.xhtml")
your_string = tree.tostring(method="text", encoding="utf-8")

Upvotes: 5

Related Questions