Patrick A
Patrick A

Reputation: 277

How to extract data from XML/SOAP in Python

The UK National Gas system publishes a mass of data that can be access from a SOAP server and an example of the returned data (for LNG) is shown below. I've written the code to generate the request and to handle the response but am tripping up on how to extract the returned information. The aim would be to upload the data into a backend database or into a Pandas dataframe.

In previous code, I've simply traversed the XML using XPATH and then iterated over the tag and extracted out the child data. Thus, I was hoping to extract:

GetPublicationDataWMResult, ApplicableAt, ApplicableFor, Value, ...
LNG Stock Level,2016-03-13T15:00:07Z, 2016-03-12T00:00:00Z, 7050.42286, ...
LNG Capacity,2016-03-13T15:00:07Z, 2016-03-12T00:00:00Z, 6515042480, ...

Having tried to use XPATH to traverse the children (/Envelope/Body/GetPublicationDataWMResponse/GetPublicationDataWMResult/) it is failing.

The logic works if I sanitize the code by adding a series of string removals but that's sub-optimal and bound to break in the future.

EXAMPLE CODE:

import requests
from lxml import objectify

def getXML():

    toDate = "2016-03-12"
    fromDate = "2016-03-12"
    dateType = "gasday"

    url="http://marketinformation.natgrid.co.uk/MIPIws-public/public/publicwebservice.asmx"
    headers = {'content-type': 'application/soap+xml; charset=utf-8'}

    body ="""<soap12:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://www.w3.org/2003/05/soap-envelope">
        <soap12:Body>
            <GetPublicationDataWM xmlns="http://www.NationalGrid.com/MIPI/">
                <reqObject>
                    <LatestFlag>Y</LatestFlag>
                    <ApplicableForFlag>Y</ApplicableForFlag>
                    <ToDate>%s</ToDate>
                    <FromDate>%s</FromDate>
                    <DateType>%s</DateType>
                    <PublicationObjectNameList>
                        <string>LNG Stock Level</string>
                        <string>LNG, Daily Aggregated Available Capacity, D+1</string>
                    </PublicationObjectNameList>
                </reqObject>
            </GetPublicationDataWM>
        </soap12:Body>
    </soap12:Envelope>""" % (toDate, fromDate,dateType)


    response = requests.post(url,data=body,headers=headers)

    return response.content

root = objectify.fromstring(getXML())

Returned XML:

<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope
    xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <soap:Body>
        <GetPublicationDataWMResponse
            xmlns="http://www.NationalGrid.com/MIPI/">
            <GetPublicationDataWMResult>
                <CLSMIPIPublicationObjectBE>
                    <PublicationObjectName>LNG Stock Level</PublicationObjectName>
                    <PublicationObjectData>
                        <CLSPublicationObjectDataBE>
                            <ApplicableAt>2016-03-13T15:00:07Z</ApplicableAt>
                            <ApplicableFor>2016-03-12T00:00:00Z</ApplicableFor>
                            <Value>7050.42286</Value>
                            <GeneratedTimeStamp>2016-03-13T15:56:00Z</GeneratedTimeStamp>
                            <QualityIndicator></QualityIndicator>
                            <Substituted>N</Substituted>
                            <CreatedDate>2016-03-13T15:56:28Z</CreatedDate>
                        </CLSPublicationObjectDataBE>
                    </PublicationObjectData>
                </CLSMIPIPublicationObjectBE>
                <CLSMIPIPublicationObjectBE>
                    <PublicationObjectName>LNG Capacity</PublicationObjectName>
                    <PublicationObjectData>
                        <CLSPublicationObjectDataBE>
                            <ApplicableAt>2016-03-12T15:30:00Z</ApplicableAt>
                            <ApplicableFor>2016-03-12T00:00:00Z</ApplicableFor>
                            <Value>6515042480</Value>
                            <GeneratedTimeStamp>2016-03-12T16:00:00Z</GeneratedTimeStamp>
                            <QualityIndicator></QualityIndicator>
                            <Substituted>N</Substituted>
                            <CreatedDate>2016-03-12T16:00:20Z</CreatedDate>
                        </CLSPublicationObjectDataBE>
                    </PublicationObjectData>
                </CLSMIPIPublicationObjectBE>
            </GetPublicationDataWMResult>
        </GetPublicationDataWMResponse>
    </soap:Body>
</soap:Envelope>

Upvotes: 3

Views: 4128

Answers (2)

har07
har07

Reputation: 89285

This is a FAQ in XML+XPath topic that involves XML with default namespace.

XML element where default namespace is declared and its descendant elements without prefix inherits the same default namespace implicitly. And in XPath expression, to reference element in namespace you need to use prefix that has been mapped to the corresponding namespace URI. Using lxml the codes will be about like the following :

root = etree.fromstring(getXML())

# map prefix 'd' to the default namespace URI
ns = { 'd': 'http://www.NationalGrid.com/MIPI/'}

publication_objects = root.xpath('//d:CLSMIPIPublicationObjectBE', namespaces=ns)
for obj in publication_objects:
    name = obj.find('d:PublicationObjectName', ns).text

    data = obj.find('d:PublicationObjectData/d:CLSPublicationObjectDataBE', ns)
    applicable_at = data.find('d:ApplicableAt', ns).text
    applicable_for = data.find('d:ApplicableFor', ns).text
    # todo: extract other relevant data and process as needed

Upvotes: 1

roadrunner66
roadrunner66

Reputation: 7941

Using your existing code I just added this:

res= getXML()

from bs4 import BeautifulSoup
soup = BeautifulSoup(res, 'html.parser')

searchTerms= ['PublicationObjectName','ApplicableAt','ApplicableFor','Value']
# LNG Stock Level,2016-03-13T15:00:07Z, 2016-03-12T00:00:00Z, 7050.42286, ...

for st in searchTerms:
    print st+'\t',
    print soup.find(st.lower()).contents[0]

Output:

PublicationObjectName   LNG Stock Level
ApplicableAt    2016-03-13T15:00:07Z
ApplicableFor   2016-03-12T00:00:00Z
Value   7050.42286

Upvotes: 1

Related Questions