Reputation: 1994
I've been reading about deleting duplicate lines all over stack. There's perl, awk, and sed solutions, however none as specific as I want and I'm at a loss.
I want to delete the duplicate <path>
tags from this XML case INSENSITIVELY with a quick bash/shell perl command. Leave all other duplicate lines (like <start>
and <end>
) intact!
Input XML:
<package>
<id>1523456789</id>
<models>
<model type="A">
<start>2016-04-20</start> <------ Duplicate line to keep
<end>2017-04-20</end> <------ Duplicate line to keep
</model>
<model type="B">
<start>2016-04-20</start> <------ Duplicate line to keep
<end>2017-04-20</end> <------ Duplicate line to keep
</model>
</models>
<userinterface>
<upath>/Example/Dir/Here</upath>
<upath>/Example/Dir/Here2</upath>
<upath>/example/dir/here</upath> <------ Duplicate line to REMOVE
</userinterface>
</package>
So far I've been able to grab the duplicate lines, but don't know how to remove them. The following
grep -H path *.[Xx][Mm][Ll] | sort | uniq -id
Gives the result:
test.xml: <upath>/example/dir/here</upath>
How do I remove that line now?
Doing the perl version or awk version below erases the <start>
and <end>
dates as well.
perl -i.bak -ne 'print unless $seen{lc($_)}++' test.xml
awk '!a[tolower($0)]++' test.xml > test.xml.new
Upvotes: 3
Views: 1215
Reputation: 45351
Hey, I've never done it with Perl before, but there's an Introductory Tutorial and everything... which wasn't super straightforward. Reading the XML::SAX::ParserFactory and XML::SAX::Base I came up with the code you see at the bottom of this answer.
Okay, I'm seeing that you've got two <start>
tags with dates that match and two <end>
tags with dates that match in the whole file, but those are in different sections. If all your duplicate lines are effectively also adjacent, as they are were in your example, you need only use the uniq
command from GNU Coreutils or an equivalent. This command could ignore case through the right use of the LC_COLLATE
environment variable setting, but honestly, I found it very hard to spot an example or read how to use LC_COLLATE
to ignore case.
Continuing with a parser:
#!/usr/bin/perl
use XML::SAX;
my $parser = XML::SAX::ParserFactory->parser(
Handler => TestXMLDeduplication->new()
);
my $ret_ref = $parser->parse_file(\*TestXMLDeduplication::DATA);
close(TestXMLDeduplication::DATA);
print "\n\nDuplicates skipped: ", $ret_ref->{skipped}, "\n";
print "Duplicates cut: ", $ret_ref->{cut}, "\n";
package TestXMLDeduplication;
use base qw(XML::SAX::Base);
my $inUserinterface;
my $inUpath;
my $upathSeen;
my $defaultOut;
my $currentOut;
my $buffer;
my %seen;
my %ret;
sub new {
# Idealy STDOUT would be an argument
my $type = shift;
#open $defaultOut, '>&', STDOUT or die "Opening STDOUT failed: $!";
$defaultOut = *STDOUT;
$currentOut = $defaultOut;
return bless {}, $type;
}
sub start_document {
%ret = ();
$inUserinterface = 0;
$inUpath = 0;
$upathSeen = 0;
}
sub end_document {
return \%ret;
}
sub start_element {
my ($self, $element) = @_;
if ('userinterface' eq $element->{Name}) {
$inUserinterface++;
%seen = ();
}
if ('upath' eq $element->{Name}) {
$buffer = q{};
undef $currentOut;
open($currentOut, '>>', \$buffer) or die "Opening buffer failed: $!";
$inUpath++;
}
print $currentOut '<', $element->{Name};
print $currentOut attributes($element->{Attributes});
print $currentOut '>';
}
sub end_element {
my ($self, $element) = @_;
print $currentOut '</', $element->{Name};
print $currentOut '>';
if ('userinterface' eq $element->{Name}) {
$inUserinterface--;
}
if ('upath' eq $element->{Name}) {
close($currentOut);
$currentOut = $defaultOut;
# Check if what's in upath was seen (lower-cased)
if ($inUserinterface && $inUpath) {
if (!exists $seen{lc($buffer)}) {
print $currentOut $buffer;
} else {
$ret{skipped}++;
$ret{cut} .= $buffer;
}
$seen{lc($buffer)} = 1;
}
$inUpath--;
}
}
sub characters {
# Note that this also capture indentation and newlines between tags etc.
my ($self, $characters) = @_;
print $currentOut $characters->{Data};
}
sub attributes {
my ($attributesRef) = @_;
my %attributes = %$attributesRef;
foreach my $a (values %attributes) {
my $v = $a->{Value};
# See also XML::Quote
$v =~ s/&/&/g;
$v =~ s/</</g;
$v =~ s/>/>/g;
$v =~ s/"/"/g;
print $currentOut ' ', $a->{Name}, '="', $v, '"';
}
}
__DATA__
<package>
<id>1523456789</id>
<models>
<model type="A">
<start>2016-04-20</start>
<end>2017-04-20</end>
</model>
<model type="B">
<start>2016-04-20</start>
<end>2017-04-20</end>
</model>
</models>
<userinterface>
<upath>/Example/Dir/Here</upath>
<upath>/Example/Dir/Here2</upath>
<upath>/example/dir/here</upath>
</userinterface>
<userinterface>
<upath>/Example/Dir/<b>Here</b></upath> <upath>/Example/Dir/Here2</upath>
<upath>/example/dir/<b>here</b></upath>
</userinterface>
</package>
This doesn't work by lines any longer and instead finds upath
tags inside userinterface
tags which it removes if they're duplicates within that parent group. The surrounding indentation and newlines are retained. Also it would get kind of weird if there were upath
tags within upath
tags.
It looks like this:
$ perl saxEG.pl
<package>
<id>1523456789</id>
<models>
<model type="A">
<start>2016-04-20</start>
<end>2017-04-20</end>
</model>
<model type="B">
<start>2016-04-20</start>
<end>2017-04-20</end>
</model>
</models>
<userinterface>
<upath>/Example/Dir/Here</upath>
<upath>/Example/Dir/Here2</upath>
</userinterface>
<userinterface>
<upath>/Example/Dir/<b>Here</b></upath> <upath>/Example/Dir/Here2</upath>
</userinterface>
</package>
Duplicates skipped: 2
Duplicates cut: <upath>/example/dir/here</upath><upath>/example/dir/<b>here</b></upath>
Upvotes: 0
Reputation: 53498
If you're parsing XML, you really should use a parser. There are multiple options for this - but DON'T use regular expressions, because they're a route to really brittle code - for all the reasons you're finding.
See: parsing XML with regex.
But the long and short of it is - XML is a contextual language. Regular expressions aren't. There are also some perfectly valid variances in XML, which are semantically identical, the regex won't handle.
E.g. Unary tags, variable indentation, paths to tags in different location and line wrapping.
I could format your source XML a bunch of different ways - all of which would be valid XML, saying the same thing. But which would break regex based parsing. That's something to be avoided - one day, mysteriously, your script will break for no particular reason, as the result of an upstream change that's valid within the XML spec.
Which is why you should use a parser:
I like XML::Twig
which is a perl
module. You can do what you want something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my %seen;
#a subroutine to process any "upath" tags.
sub process_upath {
my ( $twig, $upath ) = @_;
my $text = lc $upath -> trimmed_text;
$upath -> delete if $seen{$text}++;
}
#instantiate the parser, and configure what to 'handle'.
my $twig = XML::Twig -> new ( twig_handlers => { 'upath' => \&process_upath } );
#parse from our data block - but you'd probably use a file handle here.
$twig -> parse ( \*DATA );
#set output formatting
$twig -> set_pretty_print ( 'indented_a' );
#print to STDOUT.
$twig -> print;
__DATA__
<package>
<id>1523456789</id>
<models>
<model type="A">
<start>2016-04-20</start>
<end>2017-04-20</end>
</model>
<model type="B">
<start>2016-04-20</start>
<end>2017-04-20</end>
</model>
</models>
<userinterface>
<upath>/Example/Dir/Here</upath>
<upath>/Example/Dir/Here2</upath>
<upath>/example/dir/here</upath>
</userinterface>
</package>
This is the long form, to illustrate the concept, and it outputs:
<package>
<id>1523456789</id>
<models>
<model type="A">
<start>2016-04-20</start>
<end>2017-04-20</end>
</model>
<model type="B">
<start>2016-04-20</start>
<end>2017-04-20</end>
</model>
</models>
<userinterface>
<upath>/Example/Dir/Here</upath>
<upath>/Example/Dir/Here2</upath>
</userinterface>
</package>
It can be reduced down considerably though, via the parsefile_inplace
method.
Upvotes: 2
Reputation: 203985
$ awk '!(/<upath>/ && seen[tolower($1)]++)' file
<package>
<id>1523456789</id>
<models>
<model type="A">
<start>2016-04-20</start> <------ Duplicate line to keep
<end>2017-04-20</end> <------ Duplicate line to keep
</model>
<model type="B">
<start>2016-04-20</start> <------ Duplicate line to keep
<end>2017-04-20</end> <------ Duplicate line to keep
</model>
</models>
<userinterface>
<upath>/Example/Dir/Here</upath>
<upath>/Example/Dir/Here2</upath>
</userinterface>
</package>
Upvotes: 0
Reputation: 3474
The following script accepts an XML file as a first argument, uses xmlstarlet
( xml
in the script ) to parse the XML tree and an Associative Array ( requires Bash 4 ) to store unique <upath>
node values.
#!/bin/bash
input_file=$1
# XPath to retrieve <upath> node value.
xpath_upath_value='//package/userinterface/upath/text()'
# XPath to print XML tree excluding <userinterface> part.
xpath_exclude_userinterface_tree='//package/*[not(self::userinterface)]'
# Associative array to help us remove duplicated <upath> node values.
declare -A arr
print_userinterface_no_dup() {
printf '%s\n' "<userinterface>"
printf '<upath>%s</upath>\n' "${arr[@]}"
printf '%s\n' "</userinterface>"
}
# Iterate over each <upath> node value, lower-case it and use it as a key in the associative array.
while read -r upath; do
key="${upath,,}"
# We can remove this 'if' statement and simply arr[$key]="$upath"
# if it doesn't matter whether we remove <upath>foo</upath> or <upath>FOO</upath>
if [[ ! "${arr[$key]}" ]]; then
arr[$key]="$upath"
fi
done < <(xml sel -t -m "$xpath_upath_value" -c \. -n "$input_file")
printf '%s\n' "<package>"
# Print XML tree excluding <userinterface> part.
xml sel -t -m "$xpath_exclude_userinterface_tree" -c \. "$input_file"
# Print <userinterface> tree without duplicates.
print_userinterface_no_dup
printf '%s\n' "</package>"
Test ( script name is sof ):
$ ./sof xml_file
<package>
<id>1523456789</id>
<models>
<model type="A">
<start>2016-04-20</start>
<end>2017-04-20</end>
</model>
<model type="B">
<start>2016-04-20</start>
<end>2017-04-20</end>
</model>
</models>
<userinterface>
<upath>/Example/Dir/Here2</upath>
<upath>/Example/Dir/Here</upath>
</userinterface>
</package>
If my comments are not making the code clear enough for you, please ask and I'll answer and edit this solution accordingly.
My xmlstarlet
version is 1.6.1, compiled against libxml2 2.9.2 and libxslt 1.1.28.
Upvotes: 2
Reputation: 4628
If you want to ignore only duplicate lines right after each other, you can store the previous line and compare to that. For ignoring the case you can use tolower()
in the comparison on both sides:
awk '{ if (tolower(prev) != $0) print; prev = $0 }'
Upvotes: 1