Reputation: 1994

Delete a SPECIFIC duplicate line from XML file in place

I've been reading about deleting duplicate lines all over stack. There's perl, awk, and sed solutions, however none as specific as I want and I'm at a loss.

I want to delete the duplicate <path> tags from this XML case INSENSITIVELY with a quick bash/shell perl command. Leave all other duplicate lines (like <start> and <end>) intact!

Input XML:

  <package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>      <------ Duplicate line to keep 
        <end>2017-04-20</end>          <------ Duplicate line to keep
      </model>
      <model type="B">                 
        <start>2016-04-20</start>      <------ Duplicate line to keep
        <end>2017-04-20</end>          <------ Duplicate line to keep
      </model>
    </models>
    <userinterface>
      <upath>/Example/Dir/Here</upath>
      <upath>/Example/Dir/Here2</upath>
      <upath>/example/dir/here</upath>   <------ Duplicate line to REMOVE
    </userinterface>
  </package>

So far I've been able to grab the duplicate lines, but don't know how to remove them. The following

grep -H path *.[Xx][Mm][Ll] | sort | uniq -id

Gives the result:

test.xml:          <upath>/example/dir/here</upath>

How do I remove that line now?

Doing the perl version or awk version below erases the <start> and <end> dates as well.

perl -i.bak -ne 'print unless $seen{lc($_)}++' test.xml
awk '!a[tolower($0)]++' test.xml > test.xml.new

Upvotes: 3

Answers (5)

dlamblin

Reputation: 45351

📎It looks like you're working with XML. Would you like to parse it?

Hey, I've never done it with Perl before, but there's an Introductory Tutorial and everything... which wasn't super straightforward. Reading the XML::SAX::ParserFactory and XML::SAX::Base I came up with the code you see at the bottom of this answer.

The question was updated to not have adjacent lines; previously:

Okay, I'm seeing that you've got two <start> tags with dates that match and two <end> tags with dates that match in the whole file, but those are in different sections. If all your duplicate lines are effectively also adjacent, as they ~~are~~ were in your example, you need only use the uniq command from GNU Coreutils or an equivalent. This command could ignore case through the right use of the LC_COLLATE environment variable setting, but honestly, I found it very hard to spot an example or read how to use LC_COLLATE to ignore case.

Continuing with a parser:

#!/usr/bin/perl
use XML::SAX;

my $parser = XML::SAX::ParserFactory->parser(
    Handler => TestXMLDeduplication->new()
);

my $ret_ref = $parser->parse_file(\*TestXMLDeduplication::DATA);
close(TestXMLDeduplication::DATA);

print "\n\nDuplicates skipped: ", $ret_ref->{skipped}, "\n";
print "Duplicates cut: ", $ret_ref->{cut}, "\n";

package TestXMLDeduplication;
use base qw(XML::SAX::Base);

my $inUserinterface;
my $inUpath;
my $upathSeen;
my $defaultOut;
my $currentOut;
my $buffer;
my %seen;
my %ret;

sub new {
    # Idealy STDOUT would be an argument
    my $type = shift;
    #open $defaultOut, '>&', STDOUT or die "Opening STDOUT failed: $!";
    $defaultOut = *STDOUT;
    $currentOut = $defaultOut;
    return bless {}, $type;
}

sub start_document {
    %ret = ();
    $inUserinterface = 0;
    $inUpath = 0;
    $upathSeen = 0;
}

sub end_document {
    return \%ret;
}

sub start_element {
    my ($self, $element) = @_;

    if ('userinterface' eq $element->{Name}) {
      $inUserinterface++;
      %seen = ();
    }
    if ('upath' eq $element->{Name}) {
      $buffer = q{};
      undef $currentOut;
      open($currentOut, '>>', \$buffer) or die "Opening buffer failed: $!";
      $inUpath++;
    }

    print $currentOut '<', $element->{Name};
    print $currentOut attributes($element->{Attributes});
    print $currentOut '>';
}

sub end_element {
    my ($self, $element) = @_;

    print $currentOut '</', $element->{Name};
    print $currentOut '>';

    if ('userinterface' eq $element->{Name}) {
      $inUserinterface--;
    }

    if ('upath' eq $element->{Name}) {
      close($currentOut);
      $currentOut = $defaultOut;
      # Check if what's in upath was seen (lower-cased)
      if ($inUserinterface && $inUpath) {
    if (!exists $seen{lc($buffer)}) {
          print $currentOut $buffer;
    } else {
      $ret{skipped}++;
      $ret{cut} .= $buffer;
    }
    $seen{lc($buffer)} = 1;
      }
      $inUpath--;
    }
}

sub characters {
    # Note that this also capture indentation and newlines between tags etc.
    my ($self, $characters) = @_;

    print $currentOut $characters->{Data};
}

sub attributes {
    my ($attributesRef) = @_;
    my %attributes = %$attributesRef;

    foreach my $a (values %attributes) {
        my $v = $a->{Value};
      # See also XML::Quote
      $v =~ s/&/&amp;/g;
      $v =~ s/</&lt;/g;
      $v =~ s/>/&gt;/g;
      $v =~ s/"/&quot;/g;
    print $currentOut ' ', $a->{Name}, '="', $v, '"';
    }
}

__DATA__
  <package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>   
        <end>2017-04-20</end>    
      </model>
      <model type="B">                 
        <start>2016-04-20</start>     
        <end>2017-04-20</end>        
      </model>
    </models>
    <userinterface>
      <upath>/Example/Dir/Here</upath>
      <upath>/Example/Dir/Here2</upath>
      <upath>/example/dir/here</upath>   
    </userinterface>
    <userinterface>
      <upath>/Example/Dir/<b>Here</b></upath> <upath>/Example/Dir/Here2</upath>
      <upath>/example/dir/<b>here</b></upath>   
    </userinterface>
  </package>

This doesn't work by lines any longer and instead finds upath tags inside userinterface tags which it removes if they're duplicates within that parent group. The surrounding indentation and newlines are retained. Also it would get kind of weird if there were upath tags within upath tags.

It looks like this:

$ perl saxEG.pl
<package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>
        <end>2017-04-20</end>
      </model>
      <model type="B">
        <start>2016-04-20</start>
        <end>2017-04-20</end>
      </model>
    </models>
    <userinterface>
      <upath>/Example/Dir/Here</upath>
      <upath>/Example/Dir/Here2</upath>

    </userinterface>
    <userinterface>
      <upath>/Example/Dir/<b>Here</b></upath> <upath>/Example/Dir/Here2</upath>

    </userinterface>
  </package>
Duplicates skipped: 2
Duplicates cut: <upath>/example/dir/here</upath><upath>/example/dir/<b>here</b></upath>

Upvotes: 0

Sobrique

Reputation: 53498

If you're parsing XML, you really should use a parser. There are multiple options for this - but DON'T use regular expressions, because they're a route to really brittle code - for all the reasons you're finding.

See: parsing XML with regex.

But the long and short of it is - XML is a contextual language. Regular expressions aren't. There are also some perfectly valid variances in XML, which are semantically identical, the regex won't handle.

E.g. Unary tags, variable indentation, paths to tags in different location and line wrapping.

I could format your source XML a bunch of different ways - all of which would be valid XML, saying the same thing. But which would break regex based parsing. That's something to be avoided - one day, mysteriously, your script will break for no particular reason, as the result of an upstream change that's valid within the XML spec.

Which is why you should use a parser:

I like XML::Twig which is a perl module. You can do what you want something like this:

#!/usr/bin/env perl
use strict;
use warnings;

use XML::Twig; 

my %seen; 

#a subroutine to process any "upath" tags. 
sub process_upath {
   my ( $twig, $upath ) = @_; 
   my $text = lc $upath -> trimmed_text;
   $upath -> delete if $seen{$text}++; 
}

#instantiate the parser, and configure what to 'handle'. 
my $twig = XML::Twig -> new ( twig_handlers => { 'upath' => \&process_upath } );
   #parse from our data block - but you'd probably use a file handle here. 
   $twig -> parse ( \*DATA );
   #set output formatting
   $twig -> set_pretty_print ( 'indented_a' );
   #print to STDOUT.
   $twig -> print;

__DATA__
  <package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>   
        <end>2017-04-20</end>    
      </model>
      <model type="B">                 
        <start>2016-04-20</start>     
        <end>2017-04-20</end>        
      </model>
    </models>
    <userinterface>
      <upath>/Example/Dir/Here</upath>
      <upath>/Example/Dir/Here2</upath>
      <upath>/example/dir/here</upath>   
    </userinterface>
  </package>

This is the long form, to illustrate the concept, and it outputs:

<package>
  <id>1523456789</id>
  <models>
    <model type="A">
      <start>2016-04-20</start>
      <end>2017-04-20</end>
    </model>
    <model type="B">
      <start>2016-04-20</start>
      <end>2017-04-20</end>
    </model>
  </models>
  <userinterface>
    <upath>/Example/Dir/Here</upath>
    <upath>/Example/Dir/Here2</upath>
  </userinterface>
</package>

It can be reduced down considerably though, via the parsefile_inplace method.

Upvotes: 2

Ed Morton

Reputation: 203985

$ awk '!(/<upath>/ && seen[tolower($1)]++)' file
  <package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>      <------ Duplicate line to keep
        <end>2017-04-20</end>          <------ Duplicate line to keep
      </model>
      <model type="B">
        <start>2016-04-20</start>      <------ Duplicate line to keep
        <end>2017-04-20</end>          <------ Duplicate line to keep
      </model>
    </models>
    <userinterface>
      <upath>/Example/Dir/Here</upath>
      <upath>/Example/Dir/Here2</upath>
    </userinterface>
  </package>

Upvotes: 0

Rany Albeg Wein

Reputation: 3474

The following script accepts an XML file as a first argument, uses xmlstarlet ( xml in the script ) to parse the XML tree and an Associative Array ( requires Bash 4 ) to store unique <upath> node values.

#!/bin/bash

input_file=$1
# XPath to retrieve <upath> node value.
xpath_upath_value='//package/userinterface/upath/text()'
# XPath to print XML tree excluding  <userinterface> part.
xpath_exclude_userinterface_tree='//package/*[not(self::userinterface)]'
# Associative array to help us remove duplicated <upath> node values.
declare -A arr

print_userinterface_no_dup() { 
    printf '%s\n' "<userinterface>"
    printf '<upath>%s</upath>\n' "${arr[@]}"
    printf '%s\n' "</userinterface>"
}

# Iterate over each <upath> node value, lower-case it and use it as a key in the associative array.
while read -r upath; do
    key="${upath,,}"
    # We can remove this 'if' statement and simply arr[$key]="$upath"
    # if it doesn't matter whether we remove <upath>foo</upath> or <upath>FOO</upath>
    if [[ ! "${arr[$key]}" ]]; then
        arr[$key]="$upath"
    fi
done < <(xml sel -t -m "$xpath_upath_value" -c \. -n "$input_file")

printf '%s\n' "<package>"

# Print XML tree excluding <userinterface> part.
xml sel -t -m "$xpath_exclude_userinterface_tree" -c \. "$input_file"

# Print <userinterface> tree without duplicates.
print_userinterface_no_dup

printf '%s\n' "</package>"

Test ( script name is sof ):

$ ./sof xml_file
<package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>
        <end>2017-04-20</end>
      </model>
      <model type="B">                 
        <start>2016-04-20</start>
        <end>2017-04-20</end>
      </model>
    </models>
    <userinterface>
        <upath>/Example/Dir/Here2</upath>
        <upath>/Example/Dir/Here</upath>
    </userinterface>
</package>

If my comments are not making the code clear enough for you, please ask and I'll answer and edit this solution accordingly.

My xmlstarlet version is 1.6.1, compiled against libxml2 2.9.2 and libxslt 1.1.28.

Upvotes: 2

fejese

Reputation: 4628

If you want to ignore only duplicate lines right after each other, you can store the previous line and compare to that. For ignoring the case you can use tolower() in the comparison on both sides:

awk '{ if (tolower(prev) != $0) print; prev = $0 }'

Upvotes: 1

Delete a SPECIFIC duplicate line from XML file in place

Answers (5)

📎It looks like you're working with XML. Would you like to parse it?

The question was updated to not have adjacent lines; previously:

Related Questions