Reputation: 33

Bash - Compare rows then print just original rows

I've got files which look like this, (there can be more columns or rows):

dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
dif-2-3-4-5.com 1 1 1

And I want to compare these numbers:

And print only those rows which do not repeat, so I get this:

dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1

Upvotes: 0

Answers (4)

dawg

Reputation: 103754

This works with POSIX and gnu awk:

$ awk '{s=""
        for (i=2;i<=NF; i++) 
               s=s $i "|"} 
       s in seen { next }
       ++seen[s]' file

Which can be shortened to:

$ awk '{s=""; for (i=2;i<=NF; i++) s=s $i "|"} !seen[s]++' file

Also supports a variable number of columns.

If you want a sort uniq solution that also respects file order (i.e. the first of the set of duplicates is printed, not the later ones) you need to do a decorate, sort, undecorate approach.

You can:

use cat -n to decorate the file with line numbers;
sort -k3 -k1n to sort first on all the fields starting at the 3 though the end of the line then numerically on the line number added;
add -u if your version of sort supports that or use uniq -f3 to only keep the first in the group of dups;
finally use sed -e 's/^[[:space:]]*[0-9]*[[:space:]]*// to remove the added line numbers:

cat -n file | sort -k3 -k1n | uniq -f3 | sed -e 's/^[[:space:]]*[0-9]*[[:space:]]*//'

Awk is easier and faster in this case.

Upvotes: 1

RavinderSingh13

Reputation: 133458

Try, the following awk code too:

awk '!a[$2,$3,$4]++'   Input_file

Explanation: Create an array named a and its indexes as $2,$3,$4. The condition here is !a, (which means any line's $2,$3,$4 are NOT present in array a), and then doing 2 things:

Increasing that specific index's value to 1 so that next time that condition will NOT be true for same $2,$3,$4 indexes in array a.
Not specifying an action, (so awk works in the mode of condition and then action), so the default action will be to print the current line. This will go on for all the lines in Input_file, and the last line will not be printed as its $2,$3,$4 are already present in array a.

I hope this helps.

Upvotes: 2

David C. Rankin

Reputation: 84541

Another simple approach is sort with uniq using a KEYDEF for fields 2-4 with sort and skipping field 1 with uniq, e.g.

$ sort file.txt -k 2,4 | uniq -f1

Example Use/Output

$ sort file.txt -k 2,4 | uniq -f1
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1

Upvotes: 4

jas

Reputation: 10865

Keep a running record of the triples already seen and only print the first time they appear:

$ awk '!(($2,$3,$4) in seen) {print; seen[$2,$3,$4]}' file
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1

Upvotes: 2

Bash - Compare rows then print just original rows

Answers (4)

Related Questions