Reputation: 247
I have a list:
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
How it possible to do this:
if in whole list we see three or more email with same domain - all duplicates except first one need to remove.
Output:
[email protected]
[email protected]
[email protected]
[email protected]
Upvotes: 1
Views: 1083
Reputation: 2615
If you don't mind the order, just use sort:
sort -t '@' -u -k 2,2 your_file
If you do mind the order, do
gawk '{print NR "@" $0}' your_file | sort -t '@' -u -k 3,3 | sort -t '@' -k 1,1n | cut -d \@ -f 2-
Upvotes: 0
Reputation: 58578
This might work for you:
sed ':a;$!N;s/^\([^@]*@\([^\n]*\)\)\n.*\2/\1/;ta;P;D' file
[email protected]
[email protected]
[email protected]
Upvotes: 0
Reputation: 126762
I am puzzled why your example output contains [email protected]
twice but assume it is a mistake.
As long as there are no issues with trailing space characters or more complex forms of email addresses you can do this simply in Perl with
perl -aF@ -ne 'print unless $seen{$F[1]}++' myfile
output
[email protected]
[email protected]
[email protected]
Upvotes: 0
Reputation: 118166
#!/usr/bin/env perl
use strict; use warnings;
use Email::Address;
my %data;
while (my $line = <DATA>) {
my ($addr) = Email::Address->parse($line =~ /^(\S+)/);
push @{ $data{ $addr->host } }, $addr->original;
}
for my $addrs (values %data) {
if (@$addrs > 2) {
print "$addrs->[0]\n";
}
else {
print "$_\n" for @$addrs;
}
}
__DATA__
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Upvotes: 3
Reputation: 29896
sed -s 's/@/@\t/g' test.txt | uniq -f 1 | sed -s 's/@\t/@/g'
The first sed separates the email in 2 fields (name + domain) with a tab character, so that uniq can skip the first field when removing the duplicate domains, and the last sed removes the tab.
Upvotes: 1