Reputation: 1834
I have a 7GB XML document in the TREC format. This file have tags DOC
, in which there there is DOCNO
and TEXT
.
<FILE>
<DOC>
<DOCNO>abc</DOCNO>
<TEXT>content
of first
doc</TEXT>
</DOC>
<DOC>
<DOCNO>def</DOCNO>
<TEXT>content
of second
doc</TEXT>
</DOC>
<DOC>
<DOCNO>ghi</DOCNO>
<TEXT>content
of third
doc</TEXT>
</DOC>
</FILE>
I want to filter this document and keep only DOC
s that have a DOCNO
in a file containing a list of ids:
abc
ghi
So the output becomes
<FILE>
<DOC>
<DOCNO>abc</DOCNO>
<TEXT>content
of first
doc</TEXT>
</DOC>
<DOC>
<DOCNO>ghi</DOCNO>
<TEXT>content of
third
doc</TEXT>
</DOC>
</FILE>
My guess is that xml_grep
should be useful, but I couldn't do it.
Upvotes: 1
Views: 281
Reputation: 246992
Using awk to create the xpath and xmlstarlet to filter the document:
$ xpath=$(awk '
BEGIN {printf "//DOC[not("}
{printf "%sDOCNO=\"%s\"", sep, $0; sep=" or "}
END {print ")]"}
' ids.txt)
$ echo "$xpath"
//DOC[not(DOCNO="abc" or DOCNO="ghi")]
$ xmlstarlet ed -O -d "$xpath" file.xml
<FILE>
<DOC>
<DOCNO>abc</DOCNO>
<TEXT>content
of first
doc</TEXT>
</DOC>
<DOC>
<DOCNO>ghi</DOCNO>
<TEXT>content
of third
doc</TEXT>
</DOC>
</FILE>
Upvotes: 2
Reputation: 36272
If you have xml_grep
I assume that also is installed the perl module XML::Twig
. I don't know how xml_grep
works but you could achieve same result with a complete script, like:
#!/usr/bin/env perl
use warnings;
use strict;
use XML::Twig;
XML::Twig->new(
twig_print_outside_roots => 1,
twig_roots => {
'DOC' => sub {
my $docno = $_->next_elt('DOCNO') || next;
if ( $docno->text_only =~ m/\A(?:abc|ghi)\Z/ ) {
$_->print;
}
},
},
pretty_print => 'indented',
)->parsefile( shift );
It search for every <DOC>
element, reads the next one and extract its text, that is compared to abc
or ghi
using a regular expression, and only print that partial tree in case of a match.
Run it like:
perl script.pl xmlfile
That yields (note spaces that are not meaningful because they are out of any element):
<FILE>
<DOC>
<DOCNO>abc</DOCNO>
<TEXT>content
of first
doc</TEXT>
</DOC>
<DOC>
<DOCNO>ghi</DOCNO>
<TEXT>content
of third
doc</TEXT>
</DOC>
</FILE>
Upvotes: 3