Reputation: 780
I have a XML file that is composed of numerous XML records bounded by the tag like:
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">24273127</PMID>
...
...
...
</PubmedArticle>
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">24273128</PMID>
...
...
...
</PubmedArticle>
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">24273129</PMID>
...
...
...
</PubmedArticle>
I can generate individual files using
awk '/<PubmedArticle/{f=1; out="mysmallfile_"(++c)".xml"} f{print > out} /<\/PubmedArticle>/{close(out); f=0}' mylargefile
see https://stackoverflow.com/a/56892175/6876770
How can I generate files composed of a specific number of records each? For example supposing I had a large XML file with 1000 XML records and I wanted to create 2 x 500 XML record files?
Im thinking that awk should save to a file until it meets the defined number of tag matches and then it saves to another file.
Upvotes: 0
Views: 464
Reputation: 204488
The difference between your problem and the problem in the question you referenced is that they had blocks in their input that they did not want to appear in the output while you want every line of input to appear in an output file, so while they had to print only what was between start and end tags, and so had to test for both, you don't have that problem and only need to test for the start OR end tag to determine when to change output files.
With any awk:
$ awk -v maxRecs=2 '
/<PubmedArticle>/ && ((++recNr % maxRecs) == 1) {
close(out); out="mysmallfile_" (++fileNr) ".xml"
}
{ print > out }
' file
$ head mysmallfile*
==> mysmallfile_1.xml <==
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">24273127</PMID>
...
...
...
</PubmedArticle>
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
==> mysmallfile_2.xml <==
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">24273129</PMID>
...
...
...
</PubmedArticle>
or with GNU awk for multi-char RS and RT:
$ awk -v maxRecs=2 -v RS='</PubmedArticle>\n' -v ORS= '
(NR % maxRecs) == 1 {
close(out); out="mysmallfile_" (++fileNr) ".xml"
}
RT { print $0 RT > out }
' file
$ head mysmallfile*
==> mysmallfile_1.xml <==
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">24273127</PMID>
...
...
...
</PubmedArticle>
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
==> mysmallfile_2.xml <==
<PubmedArticle>
<MedlineCitation Status="MEDLINE" Owner="NLM">
<PMID Version="1">24273129</PMID>
...
...
...
</PubmedArticle>
I'm calling close()
to close every output file as we go to avoid a possible "too many open files" error from most awks or a slow down from gawk.
The above assumes your goal is to create as many files of "maxRecs" length as possible and then just have whatever is left of the input put into the last file so if you have an input file of 800 records you'll get an output file of 500 records and another of 300 as opposed to 2 output files of 400 each.
Upvotes: 2
Reputation: 36735
This part
/<PubmedArticle/{f=1; out="mysmallfile_"(++c)".xml"}
might be modified to give same out
n
times by harnessing integer division, for example for n=3 it will be
/<PubmedArticle/{f=1; out="mysmallfile_"int(c++/3)".xml"}
This will give for subsequent lines matching <PubmedArticle
mysmallfile_0.xml
mysmallfile_0.xml
mysmallfile_0.xml
mysmallfile_1.xml
mysmallfile_1.xml
mysmallfile_1.xml
mysmallfile_2.xml
mysmallfile_2.xml
mysmallfile_2.xml
and so on. Note that I used c++
rather than ++c
as using later would case firstly repetition n-1
times, then n
times.
This part
/<\/PubmedArticle>/{close(out); f=0}
might be adapted using remainder of division as follows
/<\/PubmedArticle>/&&c%3==0{close(out); f=0}
additional condition holds true when c
is evenly divisble by 3
, which is only true for last usage of given out
.
(tested in GNU Awk 5.0.1)
Upvotes: 1