Reputation: 24463
I need to remove all of the duplicate lines from a file, but ignoring all appearances of these characters:
(),、“”。!?#
As an example, these two lines would be considered duplicates, so one of them would be deleted:
“This is a line。“
This is a line
Similarly, these three lines would be considered duplicates, and only one would remain:
This is another line、 with more words。
“This is another line with more words。”
This is another line! with more words!
How can I delete all of the duplicate lines in a file, while ignoring some characters?
Upvotes: 1
Views: 210
Reputation: 4335
This is an approach. You collect them into arrays keyed on a normalized version. Normalized here means remove all the chars you don’t want and squash spaces too. Then it picks the shortest version to print/keep. That heuristic—which to keep—wasn’t really specified so season to taste. Code is a bit terse for production so you might flesh it out for clarity.
use utf8;
use strictures;
use open qw/ :std :utf8 /;
my %tree;
while (my $original = <DATA>) {
chomp $original;
( my $normalized = $original ) =~ tr/ (),、“”。!?#/ /sd;
push @{$tree{$normalized}}, $original;
#print "O:",$original, $/;
#print "N:",$normalized, $/;
}
@{$_} = sort { length $a <=> length $b } @{$_} for values %tree;
print $_->[0], $/ for values %tree;
__DATA__
“This is a line。“
This is a line
This is a line
This is another line、 with more words。
This is another line with more words
This is another line! with more words!
Yields–
This is another line with more words
This is a line
Upvotes: 1
Reputation: 16586
From your example, you could just delete your symbols, and then remove your duplicates.
For instance :
$ cat foo
«This is a line¡»
This is another line! with more words¡
Similarly, these three lines would be considered duplicates, and only one would remain:
This is a line
This is another line, with more words!
This is another line with more words
$ tr --delete '¡!«»,' < foo | awk '!a[$0]++'
This is a line
This is another line with more words
Similarly these three lines would be considered duplicates and only one would remain:
$
Seems to do the job.
Edit :
From your question, it seems like those symbol/punctuation mars do not matter. You should precise that.
I don't have time to write that but I think the easy way should be to parse your file and maintain an array of already printed line :
for each line:
cleanedLine = stripFromSymbol(line)
if cleanedLine not in AlreadyPrinted:
AlreadyPrinted.push(cleanedLine)
print line
Upvotes: 1