Matze
Matze

Reputation: 368

Applying awk operation to a specific column

I have a file which lines look like this:

chr1 66999275 67216822 + SGIP1;SGIP1;SGIP1;SGIP1;MIR3117

I now want to edit the last column to remove duplicates, so that it would only be SGIP1;MIR3117.

If I only have the last column, I can use the following awk code to remove the duplicates.

a="SGIP1;SGIP1;SGIP1;SGIP1;MIR3117"
echo "$a" | awk -F";" '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s%s",$i,FS)}{printf("\n")}'

This returns SGIP1;MIR3117;

However, I can not figure out how I can use this to only affect my fifth column. If I just pipe in the whole line, I get SGIP1 two times, as awk then treats everything in front of the first semicolon as one column. Is there an elegant way to do this?

Upvotes: 2

Views: 324

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133458

Could you please try following.

awk '
{
  num=split($NF,array,";")
  for(i=1;i<=num;i++){
    if(!found[array[i]]++){
      val=(val?val ";":"")array[i]
    }
  }
  $NF=val
  val=""
}
1
'  Input_file

Explanation: Adding detailed explanation for above code here.

awk '                                   ##Starting awk program from here.
{
  num=split($NF,array,";")              ##Using split function of awk to split last field($NF) of current line into array named array with ; delimiter.
  for(i=1;i<=num;i++){                  ##Running a loop fro i=1 to till total number of elements of array here.
    if(!found[array[i]]++){             ##Checking condition if any element of array is NOT present in found array then do following.
      val=(val?val ";":"")array[i]      ##Creaating variable val and keep adding value of array here(whoever satisfy above condition).
    }
  }
  $NF=val                               ##Setting val value to last field of current line here.
  val=""                                ##Nullifying variable val here.
}
1                                       ##1 will print edited/non-edited line here.
' Input_file                            ##Mentioning Input_file name here.

Upvotes: 1

Daemon Painter
Daemon Painter

Reputation: 3470

I don't consider it "elegant", and it works under a certain number of assumptions.

awk -F"+" '{printf("%s+ ",$1);split($2,a,";"); for(s in a){gsub(" ", "", a[s]); if(!c[a[s]]++) printf("%s;", a[s])}}' test.txt

Tested on your input, returns:

chr1 66999275 67216822 + SGIP1;MIR3117;

Upvotes: 0

Related Questions