Reputation: 170320
I got the entire HTTP response as a string but I want to extract just the body.
I would prefer not to use an external library or reimplement the header parsing.
Content-Type: text/xml
Content-Length: 129
<?xml version='1.0'?>
<methodResponse>
<params>
<param>
<value><boolean>0</boolean></value>
</param>
</params>
</methodResponse>
</code>
Update: If it wasn't obvious, I do get the data from other source than an URL so any attempt to use something that requires and URL is useless.
Still I do read the data from a stream like object data = stream.read()
, so a solution that can use a stream is also acceptable.
2nd update, yes this is a XMLRPC response but it's one that I'm getting using a different transport so I cannot use httplib to parse it, mainly because httplib is broken and not accepting strings or streams for parsing.
3rd update, the double newline can be \r\n\r\n
or \n\n
based on the server.
So to make it clear: the input is a HTTP response that is supposed to contain an XMLRPC response and the output has to be the response
. It doesn't have to parse the XML, but it has to be able to properly extract the XML from the response.
Upvotes: 2
Views: 7803
Reputation: 9041
Short and sweet:
body = response.split('\r\n\r\n', 1)[-1]
(it uses two argument version of split()
and [-1]
means last element of array)
Upvotes: 2
Reputation: 27565
resp = ('Content-Type: text/xml\r\n'
'Content-Length: 129\r\n'
"<?xml version='1.0'?>\r\n"
'\r\n'
'<methodResponse>\r\n'
'<params>\r\n'
'<param>\r\n'
'<value><boolean>0</boolean></value>\r\n'
'</param>\r\n'
'</params>\r\n'
'</methodResponse>\r\n'
'</code>')
print resp.partition('\r\n\r\n')[2]
result
<methodResponse>
<params>
<param>
<value><boolean>0</boolean></value>
</param>
</params>
</methodResponse>
</code>
On my display, the characters '\r' appear as squares at the end of each line.
The advantage of partition() is that it returns ALWAYS a tuple of 3 elements:
then, if there is not the sequence '\r\n\r\n' in the text,
resp.partition('\r\n\r\n')[2]
will be ""
while split('\r\n\r\n')[1]
causes an error and split('\r\n\r\n')[-1]
is the entire text.
If the double newline is variable, only a regex can hold the variability.
It is necessary to know what is the span of variability to craft a regex pattern.
Supposing that only "\n\n", "\r\n\n", "\n\r\n" and "\r\n\r\n" are possible , a solution would be to catch the body with help of the regex based on following pattern :
import re
regx = re.compile('(?:(?:\r?\n){2}|\Z)(.+)?',re.DOTALL)
for ss in (('Content-Type: text/xml\r\n'
'Content-Length: 129\r\n'
"<?xml version='1.0'?>\n"
'\n'
'body1\r\n'
'<params>\r\n'
'<param>\r\n'
'</code>') ,
('Content-Type: text/xml\r\n'
'Content-Length: 129\r\n'
"<?xml version='1.0'?>\r\n"
'\n'
'body2\r\n'
'<params>\r\n'
'<param>\r\n'
'</code>') ,
('Content-Type: text/xml\r\n'
'Content-Length: 129\r\n'
"<?xml version='1.0'?>\n"
'\r\n'
'body3\r\n'
'<params>\r\n'
'<param>\r\n'
'</code>') ,
('Content-Type: text/xml\r\n'
'Content-Length: 129\r\n'
"<?xml version='1.0'?>\r\n"
'\r\n'
'body4\r\n'
'<params>\r\n'
'<param>\r\n'
'</code>') ,
('Content-Type: text/xml\r\n'
'Content-Length: 129\r\r'
"<?xml version='1.0'?>\r\r"
'\r\n'
'body4\r\n'
'<params>\r\n'
'<param>\r\n'
'</code>') ,):
print ('splitting on sequence : %r\n%r\n') \
% (re.search('(?:\r?\n)+(?=body)',ss).group(),
regx.search(ss).group(1))
result
splitting on sequence : '\n\n'
'body1\r\n<params>\r\n<param>\r\n</code>'
splitting on sequence : '\r\n\n'
'body2\r\n<params>\r\n<param>\r\n</code>'
splitting on sequence : '\n\r\n'
'body3\r\n<params>\r\n<param>\r\n</code>'
splitting on sequence : '\r\n\r\n'
'body4\r\n<params>\r\n<param>\r\n</code>'
splitting on sequence : '\r\n'
None
Upvotes: 2
Reputation: 9456
Based on Michal solution but this one includes and essential fix:
def strip_http_headers(http_reply):
p = http_reply.find('\r\n\r\n')
if p >= 0:
return http_reply[p+4:]
return http_reply
Upvotes: 6
Reputation: 29913
Besides what Tito said, there's also the requests package
>>> import requests
>>> r = requests.get("http://yoururl")
>>> r
<Response [200]>
>>> r.content
...
And then parse it with minidom or whatever tool you choose for that.
Upvotes: 1
Reputation: 54292
In HTTP response headers are separated from body by two CRLF characters. So you can use string.find()
method like this:
p = http_reply.find('\r\n\r\n')
if p >= 0:
return http_reply[p:]
return http_reply
Upvotes: 3