Robert Jonczy
Robert Jonczy

Reputation: 123

How to extract xml from log file to parse in python

I have a log file containing xml envelopes (2 types of xml structures: request and response). What i need to do is to parse this file, extract xml-s and put them into 2 arrays as strings (1st array for requests and 2nd array for responses), so i can parse them later.

Any ideas how can i achieve this in python ?

Snippet of log file to be parsed (log contains ):

2014-10-31 12:27:33,600 INFO  Recharger_MTelemedia2Channel [mbpa.module.mgw.mtelemedia.mtbilling.MTSender][] Sending BILL request
2014-10-31 12:27:33,601 INFO  Recharger_MTelemedia2Channel [mbpa.module.mgw.mtelemedia.mtbilling.MTSender][] <?xml version="1.0" encoding="UTF-8"?>
<request xmlns="XXX" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <transactionheader>
            <username>XXX</username>
            <password>XXX</password>
            <time>31/10/2014 12:27:33</time>
            <clientreferencenumber>123</clientreferencenumber>
            <numberrequests>3</numberrequests>
            <information>Description</information>
            <postbackurl>http://localhost/status</postbackurl>
    </transactionheader>
    <transactiondetails>
            <items>
                    <item id="1" client="XXX1" keyword="test"/>
                    <item id="2" client="XXX2" keyword="test"/>
                    <item id="3" client="XXX3" keyword="test"/>
            </items>
    </transactiondetails>
</request>
2014-10-31 12:27:34,487 INFO  Recharger_MTelemedia2Channel [mbpa.module.mgw.mtelemedia.mtbilling.MTSender][] Response code 200 for bill request 
2014-10-31 12:27:34,489 INFO  Recharger_MTelemedia2Channel [mbpa.module.mgw.mtelemedia.mtbilling.MTSender][] <?xml version="1.0" encoding="UTF-8"?>

<response xmlns="XXX" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <serverreferencenumber>XXX123XXX</serverreferencenumber>
    <clientreferencenumber>123</clientreferencenumber>
    <information>Queued for Processing</information>
    <status>OK</status>
</response>

Many thanks for reply!

Regards, Robert

Upvotes: 1

Views: 4615

Answers (1)

Anzel
Anzel

Reputation: 20563

As both @Paco and @Lord_Gestalter suggested, you can use xml.etree and replace the non-XML elements from your file, something like this:

# I use re to substitute non-XML elements
import re
# then use xml module as a parser
import xml.etree.ElementTree as ET

# read your file and store in string 's'
with open('yourfilehere','r') as f:
    s = f.read()
# then remove non-XML element with re
# I also remove <?xml ...?> part as your file consists of multiple xml logs
s = re.sub(r'<\?xml.*?>', '', ''.join(re.findall(r'<.*>', s)))
# wrap your s with a root element
s = '<root>'+s+'</root>'
# parse s with ElementTree
tree = ET.fromstring(s)

tree
<Element 'root' at 0x7f2ab877e190>

if you don't care about xml parser and just want 'request' & 'response' string, use re.search

with open('yourfilehere','r') as f:
    s = f.read()    
# put the string of both request and response into 'req' and 'res'
# or you need to construct a better re.search if you have multiple requests, responses
req = [re.search(r'<request.*\/request>', s).group()]
res = [re.search(r'<response.*\/response>', s).group()]

req
['<request xmlns="XXX" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><transactionheader><username>XXX</username><password>XXX</password><time>31/10/2014 12:27:33</time><clientreferencenumber>123</clientreferencenumber><numberrequests>3</numberrequests><information>Description</information><postbackurl>http://localhost/status</postbackurl></transactionheader><transactiondetails><items><item id="1" client="XXX1" keyword="test"/><item id="2" client="XXX2" keyword="test"/><item id="3" client="XXX3" keyword="test"/></items></transactiondetails></request>']

res
['<response xmlns="XXX" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><serverreferencenumber>XXX123XXX</serverreferencenumber><clientreferencenumber>123</clientreferencenumber><information>Queued for Processing</information><status>OK</status></response>']

Upvotes: 2

Related Questions