Filter an XML document based on a list of ids

Question

I have a 7GB XML document in the TREC format. This file have tags DOC, in which there there is DOCNO and TEXT.



abc
content
of first
doc


def
content
of second
doc


ghi
content
of third
doc

I want to filter this document and keep only DOCs that have a DOCNO in a file containing a list of ids:

abc
ghi

So the output becomes



abc
content
of first
doc


ghi
content of
third
doc

My guess is that xml_grep should be useful, but I couldn't do it.

Birei · Accepted Answer

If you have xml_grep I assume that also is installed the perl module XML::Twig. I don't know how xml_grep works but you could achieve same result with a complete script, like:

#!/usr/bin/env perl

use warnings;
use strict;
use XML::Twig;

XML::Twig->new(
    twig_print_outside_roots => 1,
    twig_roots => {
        'DOC' => sub {
            my $docno = $_->next_elt('DOCNO') || next;
            if ( $docno->text_only =~ m/\A(?:abc|ghi)\Z/ ) { 
                $_->print;
            }   
        },  
    },  
    pretty_print => 'indented',
)->parsefile( shift );

It search for every element, reads the next one and extract its text, that is compared to abc or ghi using a regular expression, and only print that partial tree in case of a match.

Run it like:

perl script.pl xmlfile

That yields (note spaces that are not meaningful because they are out of any element):



  
    abc
    content
of first
doc
  


  
    ghi
    content
of third
doc

Filter an XML document based on a list of ids

Answers (2)

Related Questions