Sigurgeir
Sigurgeir

Reputation: 365

Retaining one member of a pair

Good afternoon to all,

I have a file containing two fields, each representing a member of a pair. I want to retain one member of each pair and it does not matter which member as these are codes for duplicate samples in a study.

Each pair appears twice in my file, with each member of the pair appearing once in either column.

An example of an input file is:

XXX1 XXX7 XXX2 XXX4 abc2 dcb3 XXX7 XXX1 dcb3 abc2 XXX4 XXX2

And an example of the desired output would be

XXX1
XXX2
abc2

How might this be accomplished in bash? Thank you.

Upvotes: 1

Views: 53

Answers (2)

Sigurgeir
Sigurgeir

Reputation: 365

While the answer posted by Lars above works very well I would like to suggest an alternative, just in case someone stumbles upon this problem.

I had previously used awk '!seen[$2,$1]++ {print $1}' to the same result. I didn't realize it had worked since the number of lines in my file wasn't halved. This turned out to be because of some wrong assumptions I made about my data.

Upvotes: 0

Lars Fischer
Lars Fischer

Reputation: 10149

Here is a combination of GNU awk, cut and sort, store the scipt as duplicatePairs.awk:

    { if ( $1 < $2) print $1, $2
      else print $2, $1
    }

and run it like this: awk -f duplicatePairs.awk your_file | sort -u | cut -d" " -f1

The if sorts the pairs such that a line with x,y and a line with y,x will be printed the same. Then sort -u can remove the duplicate lines. And the cut selects the first column.


With a slightly larger awk script, we can solve the requirements "awk-only":

    { 
     smallest = $1;
     if ( $1 > $2) {
        smallest = $2
     }

     if( !(smallest in seen) ) {
        seen [ smallest ] = 1
        print smallest
     }
    }

Run it like this: awk -f duplicatePairs.awk your_file

Upvotes: 2

Related Questions