Jason
Jason

Reputation: 247

How to remove duplicate line by perl or bash?

I have a list:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

How it possible to do this:

if in whole list we see three or more email with same domain - all duplicates except first one need to remove.

Output:

[email protected]
[email protected]
[email protected]
[email protected]

Upvotes: 1

Views: 1083

Answers (5)

Eran Ben-Natan
Eran Ben-Natan

Reputation: 2615

If you don't mind the order, just use sort:

sort -t '@' -u -k 2,2 your_file

If you do mind the order, do

gawk '{print NR "@" $0}' your_file | sort -t '@' -u -k 3,3 | sort -t '@' -k 1,1n | cut -d \@ -f 2-

Upvotes: 0

potong
potong

Reputation: 58578

This might work for you:

sed ':a;$!N;s/^\([^@]*@\([^\n]*\)\)\n.*\2/\1/;ta;P;D' file
[email protected]
[email protected]
[email protected]

Upvotes: 0

Borodin
Borodin

Reputation: 126762

I am puzzled why your example output contains [email protected] twice but assume it is a mistake.

As long as there are no issues with trailing space characters or more complex forms of email addresses you can do this simply in Perl with

perl -aF@ -ne 'print unless $seen{$F[1]}++' myfile

output

[email protected]
[email protected]
[email protected]

Upvotes: 0

Sinan Ünür
Sinan Ünür

Reputation: 118166

#!/usr/bin/env perl

use strict; use warnings;
use Email::Address;

my %data;

while (my $line = <DATA>) {
    my ($addr) = Email::Address->parse($line =~ /^(\S+)/);
    push @{ $data{ $addr->host } }, $addr->original;
}

for my $addrs (values %data) {
    if (@$addrs > 2) {
        print "$addrs->[0]\n";
    }
    else {
        print "$_\n" for @$addrs;
    }
}

__DATA__
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

Upvotes: 3

alexisdm
alexisdm

Reputation: 29896

 sed -s 's/@/@\t/g' test.txt | uniq -f 1 | sed -s 's/@\t/@/g'

The first sed separates the email in 2 fields (name + domain) with a tab character, so that uniq can skip the first field when removing the duplicate domains, and the last sed removes the tab.

Upvotes: 1

Related Questions