Village
Village

Reputation: 24463

How to delete duplicate lines while ignoring particular characters?

I need to remove all of the duplicate lines from a file, but ignoring all appearances of these characters:

(),、“”。!?#

As an example, these two lines would be considered duplicates, so one of them would be deleted:

“This is a line。“
This is a line

Similarly, these three lines would be considered duplicates, and only one would remain:

This is another line、 with more words。
“This is another line with more words。”
This is another line! with more words!

How can I delete all of the duplicate lines in a file, while ignoring some characters?

Upvotes: 1

Views: 210

Answers (2)

Ashley
Ashley

Reputation: 4335

This is an approach. You collect them into arrays keyed on a normalized version. Normalized here means remove all the chars you don’t want and squash spaces too. Then it picks the shortest version to print/keep. That heuristic—which to keep—wasn’t really specified so season to taste. Code is a bit terse for production so you might flesh it out for clarity.

use utf8;
use strictures;
use open qw/ :std :utf8 /;

my %tree;
while (my $original = <DATA>) {
    chomp $original;
    ( my $normalized = $original ) =~ tr/ (),、“”。!?#/ /sd;
    push @{$tree{$normalized}}, $original;
    #print "O:",$original, $/;                                                                                                                    
    #print "N:",$normalized, $/;                                                                                                                  
}

@{$_} = sort { length $a <=> length $b } @{$_} for values %tree;

print $_->[0], $/ for values %tree;

__DATA__
“This is a line。“
This is a line
This  is   a line
This is another line、 with more words。
This is another line with more words
This is another line! with more words!

Yields–

This is another line with more words
This is a line

Upvotes: 1

fredtantini
fredtantini

Reputation: 16586

From your example, you could just delete your symbols, and then remove your duplicates.

For instance :

$ cat foo
«This is a line¡»
This is another line! with more words¡

Similarly, these three lines would be considered duplicates, and only one would remain:
This is a line

This is another line, with more words!
This is another line with more words

$ tr --delete '¡!«»,' < foo | awk '!a[$0]++'
This is a line
This is another line with more words

Similarly these three lines would be considered duplicates and only one would remain:

$

Seems to do the job.

Edit :

From your question, it seems like those symbol/punctuation mars do not matter. You should precise that.

I don't have time to write that but I think the easy way should be to parse your file and maintain an array of already printed line :

for each line:
  cleanedLine = stripFromSymbol(line)
  if cleanedLine not in AlreadyPrinted:
    AlreadyPrinted.push(cleanedLine)
    print line

Upvotes: 1

Related Questions