Retaining one member of a pair

Question

Good afternoon to all,

I have a file containing two fields, each representing a member of a pair. I want to retain one member of each pair and it does not matter which member as these are codes for duplicate samples in a study.

Each pair appears twice in my file, with each member of the pair appearing once in either column.

An example of an input file is:

XXX1 XXX7 XXX2 XXX4 abc2 dcb3 XXX7 XXX1 dcb3 abc2 XXX4 XXX2

And an example of the desired output would be

XXX1
XXX2
abc2

How might this be accomplished in bash? Thank you.

Lars Fischer · Accepted Answer

Here is a combination of GNU awk, cut and sort, store the scipt as duplicatePairs.awk:

    { if ( $1 < $2) print $1, $2
      else print $2, $1
    }

and run it like this: awk -f duplicatePairs.awk your_file | sort -u | cut -d" " -f1

The if sorts the pairs such that a line with x,y and a line with y,x will be printed the same. Then sort -u can remove the duplicate lines. And the cut selects the first column.

With a slightly larger awk script, we can solve the requirements "awk-only":

    { 
     smallest = $1;
     if ( $1 > $2) {
        smallest = $2
     }

     if( !(smallest in seen) ) {
        seen [ smallest ] = 1
        print smallest
     }
    }

Run it like this: awk -f duplicatePairs.awk your_file

Retaining one member of a pair

Answers (2)

Related Questions