Gabe
Gabe

Reputation: 236

How can I split a concatenated xml file and name the extracted files using strings

How do I split a large concatenated xml files into individual xml files with the files named using strings?

input.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1001-20170101.XML" date="20170101">
</type-of-doc>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1002-20170101.XML" date="20170101">
</type-of-doc>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1008-20170101.XML" date="20170101">
</type-of-doc>

I want to read the strings file="xxxx-yyyyyyyy.XML" and create output files named as xxxx.XML

output xml files:

1001.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1001-20170101.XML" date="20170101">
</type-of-doc>

1002.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1002-20170101.XML" date="20170101">
</type-of-doc>

1008.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1008-20170101.XML" date="20170101">
</type-of-doc>

My preference is to use bash shell tools such as cat, awk, sed and or xml tools such as xmllint or similar, and log stdout and stderr to a logfile.

Appreciate approaches and testable solutions

Upvotes: 1

Views: 240

Answers (1)

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Consider the following gawk approach (if your input is constructed as in the question, line by line):

awk '/<?xml version/{ getline dt; getline typedoc; 
     if (match(typedoc,/file="([0-9]+)-[^"]+.XML"/,a)) { 
         fn=a[1]".xml"; print $0 ORS dt ORS typedoc > fn; next; 
     }}{ print > fn }
' input.xml 2> err.log

Results:

cat 1001.xml 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1001-20170101.XML" date="20170101">
</type-of-doc>

cat 1002.xml 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1002-20170101.XML" date="20170101">
</type-of-doc>

cat 1008.xml 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE type-of-doc SYSTEM "file.dtd" [ ]>
<type-of-doc lang="EN" dtd-version="v1" file="1008-20170101.XML" date="20170101">
</type-of-doc>

  • /<?xml version/ - on encountering line /<?xml version/ with xml declaration

  • getline dt; - capture next line with <!DOCTYPE

  • getline typedoc; - capture next line with starting type-of-doc tag

  • if (match(typedoc,/file="([0-9]+)-[^"]+.XML"/,a)) - match file attribute value

  • the 1st captured group ([0-9]+) will be assigned to the 1st array element a[1]

Upvotes: 1

Related Questions