Mubin
Mubin

Reputation: 4425

ignore headers text from raw text python BeautifulSoup

I'm working with the html pages using BeautifulSoup4. html files does contains request headers information at the top, how can I filter that out?

here is html file snippet

WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-17T03:07:46Z
WARC-TREC-ID: clueweb12-0206wb-51-29582
WARC-Record-ID: <urn:uuid:546b127c-040e-4dee-a565-3a3f6683f898>
Content-Type: application/http; msgtype=response
Content-Length: 29032

HTTP/1.1 200 OK
Cache-Control: private
Connection: close
Date: Fri, 17 Feb 2012 03:07:48 GMT
Content-Length: 28332
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Set-Cookie:         chkvalues=ClmZLoF4xnHoBwiZnWFzYcCMoYB/fMxYfeeJl/zhlypgwivOzw6qnVBRWzf8f19O; expires=Wed, 15-Aug-2012 02:07:48 GMT; path=/
Set-Cookie: previous-category-id=11; expires=Fri, 17-Feb-2012 03:27:48
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" >
<head id="ctl00_headTag"><title>

I want to extract text between the <html></html> nothing else. When I try to do something like this.

with codecs.open(file, 'r', 'utf-8', errors='ignore') as f:
        contents = f.read()
    soup = BeautifulSoup(contents, "lxml")
    for script in soup.find_all(["script", "style"]):  # to remove script style tags
        script.extract()
    try:
        raw_text = soup.find('html').text.lower()

    except AttributeError:
        pprint('{0} file is empty'.format(file))

in raw_text it fills up "WARC/1.0\r\nWARC-Type: response\r\nWARC-Date: 2012-02-17T03:07:46Z.... like information, means it's adding headers into raw_text.

how can I remove that header thing from my raw text.

Upvotes: 1

Views: 1116

Answers (2)

t.m.adam
t.m.adam

Reputation: 15376

HTTP headers are separated from the body by two newlines, so you could use \r\n\r\n to split your data. However your file contains both the request and response, and it would be easier to use the beginning of the body as a separator.

try:
    contents = contents[contents.index('<!DOCTYPE'):]
except ValueError:
    contents = contents[contents.index('<html'):]
soup = BeautifulSoup(contents, "lxml") 

Some html documents may not have a DOCTYPE declaration. In that case use '<html' as index after wrapping all in try except block.

Upvotes: 3

whackamadoodle3000
whackamadoodle3000

Reputation: 6748

'\n'.join([e for e in raw_text.split('\n') if (e and e[0]=="<")])

You could use this list comprehension to make sure each line begins with a <

Upvotes: 0

Related Questions