How can I split a concatenated xml file and name the extracted files using strings

Question

How do I split a large concatenated xml files into individual xml files with the files named using strings?

input.xml

I want to read the strings file="xxxx-yyyyyyyy.XML" and create output files named as xxxx.XML

output xml files:

1001.xml

1002.xml

1008.xml

My preference is to use bash shell tools such as cat, awk, sed and or xml tools such as xmllint or similar, and log stdout and stderr to a logfile.

Appreciate approaches and testable solutions

RomanPerekhrest · Accepted Answer

Consider the following gawk approach (if your input is constructed as in the question, line by line):

awk '/ fn; next; 
     }}{ print > fn }
' input.xml 2> err.log

Results:

cat 1001.xml

cat 1002.xml

cat 1008.xml

/ - on encountering line / with xml declaration


getline dt; - capture next line with 

getline typedoc; - capture next line with starting type-of-doc tag
if (match(typedoc,/file="([0-9]+)-[^"]+.XML"/,a)) - match file attribute value
the 1st captured group ([0-9]+) will be assigned to the 1st array element a[1]

How can I split a concatenated xml file and name the extracted files using strings

Answers (1)

Related Questions