Jake
Jake

Reputation: 159

Remove duplicates from list of files in perl

I know this should be pretty simple and the shell version is something like:

$ sort example.txt | uniq -u

in order to remove duplicate lines from a file. How would I go about doing this in Perl?

Upvotes: 0

Views: 2149

Answers (4)

Jonathan Leffler
Jonathan Leffler

Reputation: 753615

The interesting spin on this question is the uniq -u! I don't think the other answers I've seen tackle this; they deal with sort -u example.txt or (somewhat wastefully) sort example.txt | uniq.

The difference is that the -u option eliminates all occurrences of duplicated lines, so the output is of lines that appear only once.

To tackle this, you need to know how many times each name appears, and then you need to print the names that appear just once. Assuming the list is to be read from standard input, then this code does the trick:

my %counts;
while (<>)
{
    chomp;
    $counts{$_}++;
}

foreach my $name (sort keys %counts)
{
    print "$name\n" if $counts{$name} == 1;
}

Or, using using grep:

my %counts;
while (<>)
{
    chomp;
    $counts{$_}++;
}

{
local $, = "\n";
print grep { $counts{$_} == 1 } sort keys %counts;
}

Or, if you don't need to remove the newlines (because you're only going to print the names):

my %counts;
$counts{$_}++ for (<>);
print grep { $counts{$_} == 1 } sort keys %counts;

If you do in fact want every name that appears in the input to appear in the output (but only once), then any of the other solutions will do the trick (or, with minimal adaptation, will do the trick). In fact, since the input lines will end with a newline, you can generate the answer in just two lines:

my %counts = map { $_, 1 } <>;
print sort keys %counts;

No, you can't do it in one by simply replacing %counts in the print line with the map in the first line:

print sort keys map { $_, 1 } <>;

You get the error:

Type of arg 1 to keys must be hash or array (not map iterator) at ...

Upvotes: 1

user502515
user502515

Reputation: 4444

First of all, sort -u xxx.txt would have been smarter than sort | uniq -u.

Second, perl -ne 'print unless $seen{$_}++' is prone to integer overflow, so a more sophisticated way of perl -ne 'if(!$seen{$_}){print;$seen{$_}=1}' seems preferable.

Upvotes: 0

ysth
ysth

Reputation: 98388

Are you wanting to update a list of files to remove duplicate lines? Or process a list of files, ignoring duplicate lines? Or remove duplicate filenames from a list?

Assuming the latter:

my %seen;
@filenames = grep !$seen{$_}++, @filenames;

or other solutions from perldoc -q duplicate

Upvotes: 0

snoofkin
snoofkin

Reputation: 8895

or use 'uniq' sub from List::MoreUtils module after reading all the file to a list (although its not a good solution)

Upvotes: 0

Related Questions