Reputation: 420
import urllib2
from BeautifulSoup import *
resp = urllib2.urlopen("file:///D:/sample.html")
rawhtml = resp.read()
resp.close()
print rawhtml
I am using this code to get text from a html document, but it also gives me html code. What should i do to fetch only text from the html document?
Upvotes: 3
Views: 3147
Reputation: 19728
I've been using html2text package with beautiful soup to fix some problems of the package. e.g. html2text did not understand auml or ouml literals, only Auml and Ouml with uppercase first letter.
unicode_coded_entities_html = unicode(BeautifulStoneSoup(html,convertEntities=BeautifulStoneSoup.HTML_ENTITIES))
text = html2text.html2text(unicode_coded_entities_html)
html2text does conversion to markdown text syntax, so converted text can be rendered back to html format as well (of course some information will be lost in transformation).
Upvotes: 0
Reputation: 13910
There's also html2text.
Another option is to pipe it to "lynx -dump"
Upvotes: 1
Reputation: 10799
Adapted from Tony Segaran's Programming Collective Intelligence (page 60):
def gettextonly(soup):
v=soup.string
if v == None:
c=soup.contents
resulttext=''
for t in c:
subtext=gettextonly(t)
resulttext+=subtext+'\n'
return resulttext
else:
return v.strip()
Example usage:
>>>from BeautifulSoup import BeautifulSoup
>>>doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
>>>''.join(doc)
'<html><head><title>Page title</title></head><body><p id="firstpara" align="center">
This is paragraph <b>one</b>.<p id="secondpara" align="blah">This is
paragraph<b>two</b>.</html>'
>>>soup = BeautifulSoup(''.join(doc))
>>>gettextonly(soup)
u'Page title\n\nThis is paragraph\none\n.\n\nThis is paragraph\ntwo\n.\n\n\n\n'
Note that the result is a single string, with text from inside different tags separated by newline (\n) characters.
If you would like to extract all of the words of the text as a list of words, you can use the following function, also adapted from Tony Segaran's Programming Collective Intelligence (pg. 61):
import re
def separatewords(text):
splitter=re.compile('\\W*')
return [s.lower() for s in splitter.split(text) if s!='']
Example usage:
>>>separatewords(gettextonly(soup))
[u'page', u'title', u'this', u'is', u'paragraph', u'one', u'this', u'is',
u'paragraph', u'two']
Upvotes: 1
Reputation: 66739
The very module documentation has a way to extract all strings from a document. @ http://www.crummy.com/software/BeautifulSoup/
from BeautifulSoup import BeautifulSoup
import urllib2
resp = urllib2.urlopen("http://www.google.com")
rawhtml = resp.read()
soup = BeautifulSoup(rawhtml)
all_strings = [e for e in soup.recursiveChildGenerator()
if isinstance(e,unicode)])
print all_strings
Upvotes: 3
Reputation: 86492
Note that your example makes no use of Beautifulsoup. See the doc, and follow examples.
The following example, taken from the link above, searches the soup
for <td>
elements.
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
where, linebreak, what = incident.contents[:3]
print where.strip()
print what.strip()
print
Upvotes: 4