Reputation: 4425
I'm working with the html pages using BeautifulSoup4
. html
files does contains request headers
information at the top, how can I filter that out?
here is html
file snippet
WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-17T03:07:46Z
WARC-TREC-ID: clueweb12-0206wb-51-29582
WARC-Record-ID: <urn:uuid:546b127c-040e-4dee-a565-3a3f6683f898>
Content-Type: application/http; msgtype=response
Content-Length: 29032
HTTP/1.1 200 OK
Cache-Control: private
Connection: close
Date: Fri, 17 Feb 2012 03:07:48 GMT
Content-Length: 28332
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Set-Cookie: chkvalues=ClmZLoF4xnHoBwiZnWFzYcCMoYB/fMxYfeeJl/zhlypgwivOzw6qnVBRWzf8f19O; expires=Wed, 15-Aug-2012 02:07:48 GMT; path=/
Set-Cookie: previous-category-id=11; expires=Fri, 17-Feb-2012 03:27:48
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" >
<head id="ctl00_headTag"><title>
I want to extract text between the <html></html>
nothing else. When I try to do something like this.
with codecs.open(file, 'r', 'utf-8', errors='ignore') as f:
contents = f.read()
soup = BeautifulSoup(contents, "lxml")
for script in soup.find_all(["script", "style"]): # to remove script style tags
script.extract()
try:
raw_text = soup.find('html').text.lower()
except AttributeError:
pprint('{0} file is empty'.format(file))
in raw_text
it fills up
"WARC/1.0\r\nWARC-Type: response\r\nWARC-Date: 2012-02-17T03:07:46Z....
like information, means it's adding headers into raw_text
.
how can I remove that header thing from my raw text.
Upvotes: 1
Views: 1116
Reputation: 15376
HTTP headers are separated from the body by two newlines, so you could use \r\n\r\n
to split your data. However your file contains both the request and response, and it would be easier to use the beginning of the body as a separator.
try:
contents = contents[contents.index('<!DOCTYPE'):]
except ValueError:
contents = contents[contents.index('<html'):]
soup = BeautifulSoup(contents, "lxml")
Some html documents may not have a DOCTYPE
declaration. In that case use '<html'
as index after wrapping all in try except
block.
Upvotes: 3
Reputation: 6748
'\n'.join([e for e in raw_text.split('\n') if (e and e[0]=="<")])
You could use this list comprehension to make sure each line begins with a <
Upvotes: 0