Print the last occurrences of duplicate line only

Question

I've got stdout from a command for which I'd like to strip duplicates in reverse order.

That is, I'd like the duplicate lines stripped from the beginning not from the end. For example, to strip from the end I might use the classic technique with awk:

awk '!a[$0]++'

While brilliant, it strips the wrong lines:

$ printf 'one
four
two
three
four
' | awk '!a[$0]++'
one
four
two
three

I'd like the last occurrence of four printing i.e.

$ printf 'one
four
two
three
four
' |

ghoti · Accepted Answer

Using your example to generate input for testing:

printf 'one
four
two
three
four
'

The easiest way to handle this is simply to reverse your data, twice. The following works in BSD and OS X:

command | tail -r | awk '!a[$0]++' | tail -r

But the -r option isn't universal. If you're on Linux, you can generate the same effect with the tac command (opposite of cat) which is part of coreutils:

command | tac | awk '!a[$0]++' | tac

If neither of these works (i.e. you're on HP/UX or older Solaris, etc), you may be able to reverse things using sed:

command | sed '1!G;h;$!d' | awk '!a[$0]++' | sed '1!G;h;$!d'

Of course, you could do this with perl as well:

command | perl -e 'print reverse <>' | awk '!a[$0]++' | perl -e 'print reverse <>'

But if perl is available on your system, you might as well simplify the pipe and skip awk entirely:

command | perl -e '$a{$_}++ or print for reverse <>'

I've never really liked perl, though, and I do like doing things in shell. If you're in bash (version 4 or up), and you don't care much about performance, you can implement an array right in your shell:

mapfile -t a < <(command)
declare -A b;
for (( i=${#a[@]}-1 ; i>=0; i-- )); do ((b[${a[$i]}]++)) || echo "${a[$i]}"; done

No external tools required. :-)

UPDATE:

Inspired (or perhaps challenged) by sudo_O's answer, here's one more option that works in pure awk on BSD (i.e. doesn't require GNU awk):

command | awk '{a[NR]=$0;b[$0]=NR} END {for(i=1;i<=NR;i++) if(i==b[a[i]]) print a[i]}'

Note that this stores all input in memory twice, so it may be inappropriate for large datasets.

Print the last occurrences of duplicate line only

Answers (2)

Related Questions