sorin
sorin

Reputation: 170320

How do I get the body of a http response from a string containing the entire response, in Python?

I got the entire HTTP response as a string but I want to extract just the body.

I would prefer not to use an external library or reimplement the header parsing.

Content-Type: text/xml
Content-Length: 129

<?xml version='1.0'?>
<methodResponse>
<params>
<param>
<value><boolean>0</boolean></value>
</param>
</params>
</methodResponse>
</code>

Update: If it wasn't obvious, I do get the data from other source than an URL so any attempt to use something that requires and URL is useless.

Still I do read the data from a stream like object data = stream.read(), so a solution that can use a stream is also acceptable.

2nd update, yes this is a XMLRPC response but it's one that I'm getting using a different transport so I cannot use httplib to parse it, mainly because httplib is broken and not accepting strings or streams for parsing.

3rd update, the double newline can be \r\n\r\n or \n\n based on the server.

So to make it clear: the input is a HTTP response that is supposed to contain an XMLRPC response and the output has to be the response. It doesn't have to parse the XML, but it has to be able to properly extract the XML from the response.

Upvotes: 2

Views: 7803

Answers (6)

yak
yak

Reputation: 9041

Short and sweet:

body = response.split('\r\n\r\n', 1)[-1]

(it uses two argument version of split() and [-1] means last element of array)

Upvotes: 2

eyquem
eyquem

Reputation: 27565

resp = ('Content-Type: text/xml\r\n'
        'Content-Length: 129\r\n'
        "<?xml version='1.0'?>\r\n"
        '\r\n'
        '<methodResponse>\r\n'
        '<params>\r\n'
        '<param>\r\n'
        '<value><boolean>0</boolean></value>\r\n'
        '</param>\r\n'
        '</params>\r\n'
        '</methodResponse>\r\n'
        '</code>')

print resp.partition('\r\n\r\n')[2]

result

<methodResponse>
<params>
<param>
<value><boolean>0</boolean></value>
</param>
</params>
</methodResponse>
</code>

On my display, the characters '\r' appear as squares at the end of each line.

The advantage of partition() is that it returns ALWAYS a tuple of 3 elements:
then, if there is not the sequence '\r\n\r\n' in the text,
resp.partition('\r\n\r\n')[2] will be ""
while split('\r\n\r\n')[1] causes an error and split('\r\n\r\n')[-1] is the entire text.

EDIT

If the double newline is variable, only a regex can hold the variability.
It is necessary to know what is the span of variability to craft a regex pattern.

Supposing that only "\n\n", "\r\n\n", "\n\r\n" and "\r\n\r\n" are possible , a solution would be to catch the body with help of the regex based on following pattern :

import re

regx = re.compile('(?:(?:\r?\n){2}|\Z)(.+)?',re.DOTALL)

for ss in (('Content-Type: text/xml\r\n'
            'Content-Length: 129\r\n'
            "<?xml version='1.0'?>\n"
            '\n'
            'body1\r\n'
            '<params>\r\n'
            '<param>\r\n'
            '</code>') ,
           ('Content-Type: text/xml\r\n'
            'Content-Length: 129\r\n'
            "<?xml version='1.0'?>\r\n"
            '\n'
            'body2\r\n'
            '<params>\r\n'
            '<param>\r\n'
            '</code>') ,
           ('Content-Type: text/xml\r\n'
            'Content-Length: 129\r\n'
            "<?xml version='1.0'?>\n"
            '\r\n'
            'body3\r\n'
            '<params>\r\n'
            '<param>\r\n'
            '</code>') ,
           ('Content-Type: text/xml\r\n'
            'Content-Length: 129\r\n'
            "<?xml version='1.0'?>\r\n"
            '\r\n'
            'body4\r\n'
            '<params>\r\n'
            '<param>\r\n'
            '</code>') ,
           ('Content-Type: text/xml\r\n'
            'Content-Length: 129\r\r'
            "<?xml version='1.0'?>\r\r"
            '\r\n'
            'body4\r\n'
            '<params>\r\n'
            '<param>\r\n'
            '</code>') ,):
    print ('splitting on sequence  :  %r\n%r\n') \
           % (re.search('(?:\r?\n)+(?=body)',ss).group(),
              regx.search(ss).group(1))

result

splitting on sequence  :  '\n\n'
'body1\r\n<params>\r\n<param>\r\n</code>'

splitting on sequence  :  '\r\n\n'
'body2\r\n<params>\r\n<param>\r\n</code>'

splitting on sequence  :  '\n\r\n'
'body3\r\n<params>\r\n<param>\r\n</code>'

splitting on sequence  :  '\r\n\r\n'
'body4\r\n<params>\r\n<param>\r\n</code>'

splitting on sequence  :  '\r\n'
None

Upvotes: 2

bogdan
bogdan

Reputation: 9456

Based on Michal solution but this one includes and essential fix:

def strip_http_headers(http_reply):
    p = http_reply.find('\r\n\r\n')
    if p >= 0:
        return http_reply[p+4:]
    return http_reply

Upvotes: 6

Johannes Charra
Johannes Charra

Reputation: 29913

Besides what Tito said, there's also the requests package

>>> import requests
>>> r = requests.get("http://yoururl")
>>> r
<Response [200]>
>>> r.content
...

And then parse it with minidom or whatever tool you choose for that.

Upvotes: 1

Michał Niklas
Michał Niklas

Reputation: 54292

In HTTP response headers are separated from body by two CRLF characters. So you can use string.find() method like this:

p = http_reply.find('\r\n\r\n')
if p >= 0:
    return http_reply[p:]
return http_reply

Upvotes: 3

tito
tito

Reputation: 13251

You can use standard urllib2:

from urllib2 import urlopen
data = urlopen('http://url.here/').read()

And if you want to parse xml:

from urllib2 import urlopen
from xml.dom.minidom import parse

xml = parse(urlopen('http://url.here'))

Upvotes: 1

Related Questions