mossaab
mossaab

Reputation: 1834

Filter an XML document based on a list of ids

I have a 7GB XML document in the TREC format. This file have tags DOC, in which there there is DOCNO and TEXT.

<FILE>
<DOC>
<DOCNO>abc</DOCNO>
<TEXT>content
of first
doc</TEXT>
</DOC>
<DOC>
<DOCNO>def</DOCNO>
<TEXT>content
of second
doc</TEXT>
</DOC>
<DOC>
<DOCNO>ghi</DOCNO>
<TEXT>content
of third
doc</TEXT>
</DOC>
</FILE>

I want to filter this document and keep only DOCs that have a DOCNO in a file containing a list of ids:

abc
ghi

So the output becomes

<FILE>
<DOC>
<DOCNO>abc</DOCNO>
<TEXT>content
of first
doc</TEXT>
</DOC>
<DOC>
<DOCNO>ghi</DOCNO>
<TEXT>content of
third
doc</TEXT>
</DOC>
</FILE>

My guess is that xml_grep should be useful, but I couldn't do it.

Upvotes: 1

Views: 281

Answers (2)

glenn jackman
glenn jackman

Reputation: 246992

Using awk to create the xpath and xmlstarlet to filter the document:

$ xpath=$(awk '
            BEGIN {printf "//DOC[not("} 
            {printf "%sDOCNO=\"%s\"", sep, $0; sep=" or "}
            END {print ")]"}
        ' ids.txt)

$ echo "$xpath"
//DOC[not(DOCNO="abc" or DOCNO="ghi")]

$ xmlstarlet ed -O -d "$xpath" file.xml
<FILE>
  <DOC>
    <DOCNO>abc</DOCNO>
    <TEXT>content
of first
doc</TEXT>
  </DOC>
  <DOC>
    <DOCNO>ghi</DOCNO>
    <TEXT>content
of third
doc</TEXT>
  </DOC>
</FILE>

Upvotes: 2

Birei
Birei

Reputation: 36272

If you have xml_grep I assume that also is installed the module XML::Twig. I don't know how xml_grep works but you could achieve same result with a complete script, like:

#!/usr/bin/env perl

use warnings;
use strict;
use XML::Twig;

XML::Twig->new(
    twig_print_outside_roots => 1,
    twig_roots => {
        'DOC' => sub {
            my $docno = $_->next_elt('DOCNO') || next;
            if ( $docno->text_only =~ m/\A(?:abc|ghi)\Z/ ) { 
                $_->print;
            }   
        },  
    },  
    pretty_print => 'indented',
)->parsefile( shift );

It search for every <DOC> element, reads the next one and extract its text, that is compared to abc or ghi using a regular expression, and only print that partial tree in case of a match.

Run it like:

perl script.pl xmlfile

That yields (note spaces that are not meaningful because they are out of any element):

<FILE>

  <DOC>
    <DOCNO>abc</DOCNO>
    <TEXT>content
of first
doc</TEXT>
  </DOC>


  <DOC>
    <DOCNO>ghi</DOCNO>
    <TEXT>content
of third
doc</TEXT>
  </DOC>
</FILE>

Upvotes: 3

Related Questions