bbedward
bbedward

Reputation: 6478

Splitting multiple XML elements from a single file into multiple files

I have a file that looks something like this.

a
B abc
c abc
d abc
e abc
<stuff></stuff><?xml version="1.0" encoding="UTF-8" standalone="yes"?><element><moreelements>1</moreelements></element>

a
B abc
c abc
d abc
e abc
<stuff></stuff><?xml version="1.0" encoding="UTF-8" standalone="yes"?><element><moreelements>1234</moreelements></element>

a
B abc
c abc
d abc
e abc
<stuff></stuff><?xml version="1.0" encoding="UTF-8" standalone="yes"?><element><moreelements>12354</moreelements></element>

a
B abc
c abc
d abc
e abc
<stuff></stuff><?xml version="1.0" encoding="UTF-8" standalone="yes"?><element><moreelements>12534</moreelements></element>

a
B abc
c abc
d abc
e abc
<stuff></stuff><?xml version="1.0" encoding="UTF-8" standalone="yes"?><element><moreelements>12634</moreelements></element>

With a large number, say 1000+ of similar items repeated.

I want to read the file, extract every <element> into its own file.

So with the 1 file I want to create multiple files that contain text like:

<element><moreelements>1</moreelements></element>

I'd prefer to keep the XML declaration <?xml version="1.0" encoding="UTF-8" standalone="yes"?> but it's not a requirement.

So if <element>....</element> is repeated in 1 file 1000 times, I want to turn it into 1000 files.

I'm sure there's a way with unix utilities like awk or sed but I'm not sure how to accomplish it.

Thanks

Upvotes: 0

Views: 250

Answers (2)

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Alternative gawk approach:

awk '$0~/<element>/{match($0, /<element>.+<\/element>/); 
     print substr($0,RSTART,RLENGTH) > "el_"++c".xml"}' file

head el_*
==> el_1.xml <==
<element><moreelements>1</moreelements></element>

==> el_2.xml <==
<element><moreelements>1234</moreelements></element>

==> el_3.xml <==
<element><moreelements>12354</moreelements></element>

==> el_4.xml <==
<element><moreelements>12534</moreelements></element>

==> el_5.xml <==
<element><moreelements>12634</moreelements></element>

$0~/<element>/ - to consider only lines with <element> tag

match($0, /<element>.+<\/element>/) - matching an entire <element> tag

Upvotes: 1

karakfa
karakfa

Reputation: 67467

a gawk hack...

$ tag="element>"; awk -v RS="</?$tag" -v t="$tag" '
       !(NR%2){print "<"t $0 "</"t > "element_"++c".xml"}' file

$ head element_*

==> element_1.xml <==
<element><moreelements>1</moreelements></element>

==> element_2.xml <==
<element><moreelements>1234</moreelements></element>

==> element_3.xml <==
<element><moreelements>12354</moreelements></element>

==> element_4.xml <==
<element><moreelements>12534</moreelements></element>

==> element_5.xml <==
<element><moreelements>12634</moreelements></element>

Upvotes: 2

Related Questions