Reputation: 6478
I have a file that looks something like this.
a
B abc
c abc
d abc
e abc
<stuff></stuff><?xml version="1.0" encoding="UTF-8" standalone="yes"?><element><moreelements>1</moreelements></element>
a
B abc
c abc
d abc
e abc
<stuff></stuff><?xml version="1.0" encoding="UTF-8" standalone="yes"?><element><moreelements>1234</moreelements></element>
a
B abc
c abc
d abc
e abc
<stuff></stuff><?xml version="1.0" encoding="UTF-8" standalone="yes"?><element><moreelements>12354</moreelements></element>
a
B abc
c abc
d abc
e abc
<stuff></stuff><?xml version="1.0" encoding="UTF-8" standalone="yes"?><element><moreelements>12534</moreelements></element>
a
B abc
c abc
d abc
e abc
<stuff></stuff><?xml version="1.0" encoding="UTF-8" standalone="yes"?><element><moreelements>12634</moreelements></element>
With a large number, say 1000+ of similar items repeated.
I want to read the file, extract every <element>
into its own file.
So with the 1 file I want to create multiple files that contain text like:
<element><moreelements>1</moreelements></element>
I'd prefer to keep the XML declaration <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
but it's not a requirement.
So if <element>....</element>
is repeated in 1 file 1000 times, I want to turn it into 1000 files.
I'm sure there's a way with unix utilities like awk or sed but I'm not sure how to accomplish it.
Thanks
Upvotes: 0
Views: 250
Reputation: 92854
Alternative gawk approach:
awk '$0~/<element>/{match($0, /<element>.+<\/element>/);
print substr($0,RSTART,RLENGTH) > "el_"++c".xml"}' file
head el_*
==> el_1.xml <==
<element><moreelements>1</moreelements></element>
==> el_2.xml <==
<element><moreelements>1234</moreelements></element>
==> el_3.xml <==
<element><moreelements>12354</moreelements></element>
==> el_4.xml <==
<element><moreelements>12534</moreelements></element>
==> el_5.xml <==
<element><moreelements>12634</moreelements></element>
$0~/<element>/
- to consider only lines with <element>
tag
match($0, /<element>.+<\/element>/)
- matching an entire <element>
tag
Upvotes: 1
Reputation: 67467
a gawk
hack...
$ tag="element>"; awk -v RS="</?$tag" -v t="$tag" '
!(NR%2){print "<"t $0 "</"t > "element_"++c".xml"}' file
$ head element_*
==> element_1.xml <==
<element><moreelements>1</moreelements></element>
==> element_2.xml <==
<element><moreelements>1234</moreelements></element>
==> element_3.xml <==
<element><moreelements>12354</moreelements></element>
==> element_4.xml <==
<element><moreelements>12534</moreelements></element>
==> element_5.xml <==
<element><moreelements>12634</moreelements></element>
Upvotes: 2