Cold-Blooded
Cold-Blooded

Reputation: 420

html to text conversion using python language

import urllib2

from BeautifulSoup import *

resp = urllib2.urlopen("file:///D:/sample.html")

rawhtml = resp.read()

resp.close()
print rawhtml

I am using this code to get text from a html document, but it also gives me html code. What should i do to fetch only text from the html document?

Upvotes: 3

Views: 3147

Answers (5)

Mikael Lepistö
Mikael Lepistö

Reputation: 19728

I've been using html2text package with beautiful soup to fix some problems of the package. e.g. html2text did not understand auml or ouml literals, only Auml and Ouml with uppercase first letter.

unicode_coded_entities_html = unicode(BeautifulStoneSoup(html,convertEntities=BeautifulStoneSoup.HTML_ENTITIES))
text = html2text.html2text(unicode_coded_entities_html)

html2text does conversion to markdown text syntax, so converted text can be rendered back to html format as well (of course some information will be lost in transformation).

Upvotes: 0

Miki Tebeka
Miki Tebeka

Reputation: 13910

There's also html2text.

Another option is to pipe it to "lynx -dump"

Upvotes: 1

Bryce Thomas
Bryce Thomas

Reputation: 10799

Adapted from Tony Segaran's Programming Collective Intelligence (page 60):

def gettextonly(soup):
    v=soup.string
    if v == None:
        c=soup.contents
        resulttext=''
        for t in c:
            subtext=gettextonly(t)
            resulttext+=subtext+'\n'
        return resulttext
    else:
        return v.strip()

Example usage:

>>>from BeautifulSoup import BeautifulSoup

>>>doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
>>>''.join(doc)
'<html><head><title>Page title</title></head><body><p id="firstpara" align="center">
This is paragraph <b>one</b>.<p id="secondpara" align="blah">This is
paragraph<b>two</b>.</html>'

>>>soup = BeautifulSoup(''.join(doc))
>>>gettextonly(soup)
u'Page title\n\nThis is paragraph\none\n.\n\nThis is paragraph\ntwo\n.\n\n\n\n'

Note that the result is a single string, with text from inside different tags separated by newline (\n) characters.

If you would like to extract all of the words of the text as a list of words, you can use the following function, also adapted from Tony Segaran's Programming Collective Intelligence (pg. 61):

import re
def separatewords(text):
    splitter=re.compile('\\W*')
    return [s.lower() for s in splitter.split(text) if s!='']

Example usage:

>>>separatewords(gettextonly(soup))
[u'page', u'title', u'this', u'is', u'paragraph', u'one', u'this', u'is', 
u'paragraph', u'two']

Upvotes: 1

pyfunc
pyfunc

Reputation: 66739

The very module documentation has a way to extract all strings from a document. @ http://www.crummy.com/software/BeautifulSoup/

from BeautifulSoup import BeautifulSoup
import urllib2

resp = urllib2.urlopen("http://www.google.com")
rawhtml = resp.read()
soup = BeautifulSoup(rawhtml)

all_strings = [e for e in soup.recursiveChildGenerator() 
         if isinstance(e,unicode)])
print all_strings

Upvotes: 3

gimel
gimel

Reputation: 86492

Note that your example makes no use of Beautifulsoup. See the doc, and follow examples.

The following example, taken from the link above, searches the soup for <td> elements.

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
    where, linebreak, what = incident.contents[:3]
    print where.strip()
    print what.strip()
    print

Upvotes: 4

Related Questions