Reputation: 236
How do I split a large concatenated xml files into individual xml files with the files named using strings?
input.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1001-20170101.XML" date="20170101">
</type-of-doc>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1002-20170101.XML" date="20170101">
</type-of-doc>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1008-20170101.XML" date="20170101">
</type-of-doc>
I want to read the strings file="xxxx-yyyyyyyy.XML"
and create output files named as xxxx.XML
output xml files:
1001.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1001-20170101.XML" date="20170101">
</type-of-doc>
1002.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1002-20170101.XML" date="20170101">
</type-of-doc>
1008.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1008-20170101.XML" date="20170101">
</type-of-doc>
My preference is to use bash shell tools such as cat, awk, sed and or xml tools such as xmllint or similar, and log stdout and stderr to a logfile.
Appreciate approaches and testable solutions
Upvotes: 1
Views: 240
Reputation: 92854
Consider the following gawk approach (if your input is constructed as in the question, line by line):
awk '/<?xml version/{ getline dt; getline typedoc;
if (match(typedoc,/file="([0-9]+)-[^"]+.XML"/,a)) {
fn=a[1]".xml"; print $0 ORS dt ORS typedoc > fn; next;
}}{ print > fn }
' input.xml 2> err.log
Results:
cat 1001.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1001-20170101.XML" date="20170101">
</type-of-doc>
cat 1002.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1002-20170101.XML" date="20170101">
</type-of-doc>
cat 1008.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1008-20170101.XML" date="20170101">
</type-of-doc>
/<?xml version/
- on encountering line /<?xml version/
with xml
declaration
getline dt;
- capture next line with <!DOCTYPE
getline typedoc;
- capture next line with starting type-of-doc
tag
if (match(typedoc,/file="([0-9]+)-[^"]+.XML"/,a))
- match file
attribute value
the 1st captured group ([0-9]+)
will be assigned to the 1st array element a[1]
Upvotes: 1