Reputation: 1736
I have a log that has SOAP request/response entries:
[2015-02-03 19:05:13] TIME:03.02.2015 19:05:13,
RAW_REQUEST:<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
</SOAP-ENV:Body></SOAP-ENV:Envelope>
,
uid:0de7d51a-abb6-11e4-a436-005056936d96,
===
I want to extract all xmls to a one big xml file (extract chunks and wrap with root ... tag). But also i need a date of log record.
I want (root xmlns attributes i could add with hands) to achieve same result:
<Records xmlns="" ...>
<Record datetime="2015-02-03 19:05:13">
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body>
<!-- Other xml data -->
</SOAP-ENV:Body></SOAP-ENV:Envelope>
</Record>
...
</Records>
Upvotes: 2
Views: 2048
Reputation: 8402
You can do this using awk
for example create a file named awkscript
and add the following codes
BEGIN{print "\n<Records xmlns=\""}
$0~/^\[[0-9]{1,4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\]/{
print "\t<Record datetime=\"" substr($1,2,19),substr($3,1)"\">"
getline
while ($0!~/^\[[0-9]{1,4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}\]/ && $0!~/^<\/*SOAP-ENV:.*/){getline}
while($0~/^<\/*SOAP-ENV:.*/){print "\t\t" $0;getline};{print "\t </Record>"}}
END{print "<\/Records>"}
run script with your file in a shell
awk -f path_to_awkscript path_to_xml_file > path_to_new_file
Example
Using the script with an xml file with the following data
[2015-02-03 19:05:13] TIME:03.02.2015 19:05:13,
RAW_REQUEST:<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
</SOAP-ENV:Body></SOAP-ENV:Envelope>
,
uid:0de7d51a-abb6-11e4-a436-005056936d96,
===
[2014-11-03 19:05:13] TIME:03.02.2015 19:05:13,
RAW_REQUEST:<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
</SOAP-ENV:Body></SOAP-ENV:Envelope>
,
uid:0de7d51a-abb6-11e4-a436-005056936d96,
===
[2014-12-15 19:05:13] TIME:03.02.2015 19:05:13,
RAW_REQUEST:<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
</SOAP-ENV:Body></SOAP-ENV:Envelope>
,
uid:0de7d51a-abb6-11e4-a436-005056936d96,
===
</SOAP-ENV:Body></SOAP-ENV:Envelope>
Results
<Records xmlns="
<Record datetime="2015-02-03 TIME:03.02.2015">
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
</SOAP-ENV:Body></SOAP-ENV:Envelope>
</Record>
<Record datetime="2014-11-03 TIME:03.02.2015">
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
</SOAP-ENV:Body></SOAP-ENV:Envelope>
</Record>
<Record datetime="2014-12-15 TIME:03.02.2015">
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns1="pay_parent" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="providers"><SOAP-ENV:Body><!-- ... -->
</SOAP-ENV:Body></SOAP-ENV:Envelope>
</Record>
</Records>
Upvotes: 1
Reputation: 1736
I could not find a solution with linux console tools like a grep or a sed. So i write a python script.
import sys
import re
def write_xml_log(out_path, lines):
u"""
Joins xml chunks into one document.
"""
out_fh = open(out_path, 'w+')
out_fh.write('<?xml version="1.0" encoding="UTF-8"?>\n')
out_fh.write('<LogRecords>\n')
out_fh.writelines((
'<LogRecord>\n{}\n</LogRecord>\n'.format(line) for line in lines))
out_fh.write('</LogRecords>')
out_fh.close()
def prepare_xml_chunks(log_path):
u"""
Prepares xml-chunks.
"""
log_fh = open(log_path)
record_date_re = re.compile('^\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\]')
envelope_start_re = re.compile('(<(?:[\w_-]+:)?Envelope)(.*)$')
envelope_end_re = re.compile('(.*</(?:[\w_-]+:)?Envelope>)')
envelope_complete_re = re.compile(
'(<(?:[\w_-]+:)?Envelope)(.*?>.*?</(?:[\w_-]+:)?Envelope>)')
record_date = ''
record_envelope = ''
state_in_envelope = False
for line in log_fh:
match_date = record_date_re.match(line)
match_envelope_start = envelope_start_re.match(line)
match_envelope_end = envelope_end_re.match(line)
match_envelope_complete = envelope_complete_re.match(line)
if match_date:
record_date = match_date.group(1)
if not state_in_envelope:
# One-line envelope
if match_envelope_complete:
state_in_envelope = False
record_envelope = ''
yield '{} datetime="{}" {}\n'.format(
match_envelope_complete.group(1),
record_date,
match_envelope_complete.group(2))
# Multi-line envelope start.
elif match_envelope_start:
state_in_envelope = True
record_envelope = '{} datetime="{}" {}\n'.format(
match_envelope_start.group(1),
record_date,
match_envelope_start.group(2))
# Problem situation.
elif match_envelope_end:
raise Exception('Envelope close tag without open tag.')
else:
# Multi-line envelope continue.
if not match_envelope_end:
record_envelope += line
# Multi-line envelope end.
else:
record_envelope += match_envelope_end.group(1)
yield '{}\n'.format(record_envelope)
record_envelope = ''
state_in_envelope = False
log_fh.close()
write_xml_log(sys.argv[2], prepare_xml_chunks(sys.argv[1]))
Upvotes: 0