D.Parker
D.Parker

Reputation: 171

removing duplicated strings within a column with shell

I have a file with two columns separated by tabs as follows:

OG0000000   PF03169,PF03169,PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,PF00083,PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,PF00012,

I just want to remove duplicate strings within the second column, while not changing anything in the first column, so that my final output looks like this:

OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,

I tried to start this by using awk.

awk 'BEGIN{RS=ORS=","} !seen[$0]++' file.txt

But my output looks like this, where there are still some duplicates if the duplicated string occurs first.

OG0000000   PF03169,PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,PF07690,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,PF00012,

I realize that the problem is because the first line that awk grabs is everything until the first comma, but I'm still rough with awk commands and couldn't figure out how to fix this without messing up the first column. Thanks in advance!

Upvotes: 11

Views: 288

Answers (6)

anubhava
anubhava

Reputation: 785276

This awk should work for you:

awk -F '[\t,]' '
{
   printf "%s", $1 "\t"
   for (i=2; i<=NF; ++i) {
      if (!seen[$i]++)
         printf "%s,", $i
   }
   print ""
   delete seen
}' file

OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,

PS: As per the expected output shown this solution also shows a trailing comma in each line.

Upvotes: 11

dawg
dawg

Reputation: 103884

Here is a ruby:

ruby -ane 'puts "#{$F[0]}\t#{$F[1].split(/(?<=.),(?=.)/).uniq.join(",")}"' file
OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,PF00012,

Upvotes: 2

potong
potong

Reputation: 58440

This might work for you (GNU sed):

sed -E ':a;s/(\s+.*(\b\S+,).*)\2/\1/;ta' file

Iterate through a line removing any duplicate strings after whitespace.

Upvotes: 6

RavinderSingh13
RavinderSingh13

Reputation: 133538

With your shown samples and attempts, please try following awk code. We need not to set RS and ORS they are Record separator and Output record separator respectively, which we need not to set in this requirement. Set FS and OFS to , and printing fields accordingly.

awk '
BEGIN{ FS=","; OFS="\t" }
{
  val=""
  delete arr
  num=split($2,arr,",")
  for(i=1;i<=num;i++){
   if(!arr[$i]++){
      val=(val?val ",":"") $i
   }
  }
  print $1,val
}
' Input_file

Upvotes: 6

David C. Rankin
David C. Rankin

Reputation: 84559

Another approach using the same spit of $2 into an array and keeping a separate counter for the position of the non-duplicated values posted could be done as:

awk '
  { 
    printf "%s\t", $1
    delete seen
    n = split($2,arr,",")
    pos = 0
    for (i=1;i<=n;i++) { 
      if (! (arr[i] in seen)) { 
        printf "%s%s", pos ? "," : "", arr[i]
        seen[arr[i]]=1
        pos++ 
      }
    }
    print ""
  }
' file.txt

Example Output

With your input in file.txt, the output is:

OG0000000       PF03169,MAC1_004431-T1,
OG0000002       PF07690,PF00083,
OG0000003       MAC1_000127-T1,
OG0000004       PF13246,PF00689,PF00690,
OG0000005       PF00012,PF01061,PF12697,

Upvotes: 8

sseLtaH
sseLtaH

Reputation: 11227

Using GNU sed

$ sed -E ':a;s/([^ \t]*[ \t]+)?(([[:alnum:]]+,).*)\3/\1\2/;ta' input_file
OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,

Upvotes: 5

Related Questions