ignore headers text from raw text python BeautifulSoup

Question

I'm working with the html pages using BeautifulSoup4. html files does contains request headers information at the top, how can I filter that out?

here is html file snippet

WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-17T03:07:46Z
WARC-TREC-ID: clueweb12-0206wb-51-29582
WARC-Record-ID: 
Content-Type: application/http; msgtype=response
Content-Length: 29032

HTTP/1.1 200 OK
Cache-Control: private
Connection: close
Date: Fri, 17 Feb 2012 03:07:48 GMT
Content-Length: 28332
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Set-Cookie:         chkvalues=ClmZLoF4xnHoBwiZnWFzYcCMoYB/fMxYfeeJl/zhlypgwivOzw6qnVBRWzf8f19O; expires=Wed, 15-Aug-2012 02:07:48 GMT; path=/
Set-Cookie: previous-category-id=11; expires=Fri, 17-Feb-2012 03:27:48




</code></pre>

<p>I want to extract text between the <code><html></html></code> nothing else. When I try to do something like this.</p>

<pre><code>with codecs.open(file, 'r', 'utf-8', errors='ignore') as f:
        contents = f.read()
    soup = BeautifulSoup(contents, "lxml")
    for script in soup.find_all(["script", "style"]):  # to remove script style tags
        script.extract()
    try:
        raw_text = soup.find('html').text.lower()

    except AttributeError:
        pprint('{0} file is empty'.format(file))
</code></pre>

<p>in <code>raw_text</code> it fills up 
<code>"WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-17T03:07:46Z....</code> like information, means it's adding headers into <code>raw_text</code>.</p>

<p>how can I remove that header thing from my raw text.</p>

t.m.adam · Accepted Answer

HTTP headers are separated from the body by two newlines, so you could use to split your data. However your file contains both the request and response, and it would be easier to use the beginning of the body as a separator.

try:
    contents = contents[contents.index('



Some html documents may not have a DOCTYPE declaration. In that case use ' as index after wrapping all in try except block.

ignore headers text from raw text python BeautifulSoup

Answers (2)

Related Questions