Reputation: 368
I have a file which lines look like this:
chr1 66999275 67216822 + SGIP1;SGIP1;SGIP1;SGIP1;MIR3117
I now want to edit the last column to remove duplicates, so that it would only be SGIP1;MIR3117.
If I only have the last column, I can use the following awk code to remove the duplicates.
a="SGIP1;SGIP1;SGIP1;SGIP1;MIR3117"
echo "$a" | awk -F";" '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s%s",$i,FS)}{printf("\n")}'
This returns SGIP1;MIR3117;
However, I can not figure out how I can use this to only affect my fifth column. If I just pipe in the whole line, I get SGIP1 two times, as awk then treats everything in front of the first semicolon as one column. Is there an elegant way to do this?
Upvotes: 2
Views: 324
Reputation: 133458
Could you please try following.
awk '
{
num=split($NF,array,";")
for(i=1;i<=num;i++){
if(!found[array[i]]++){
val=(val?val ";":"")array[i]
}
}
$NF=val
val=""
}
1
' Input_file
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
{
num=split($NF,array,";") ##Using split function of awk to split last field($NF) of current line into array named array with ; delimiter.
for(i=1;i<=num;i++){ ##Running a loop fro i=1 to till total number of elements of array here.
if(!found[array[i]]++){ ##Checking condition if any element of array is NOT present in found array then do following.
val=(val?val ";":"")array[i] ##Creaating variable val and keep adding value of array here(whoever satisfy above condition).
}
}
$NF=val ##Setting val value to last field of current line here.
val="" ##Nullifying variable val here.
}
1 ##1 will print edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.
Upvotes: 1
Reputation: 3470
I don't consider it "elegant", and it works under a certain number of assumptions.
awk -F"+" '{printf("%s+ ",$1);split($2,a,";"); for(s in a){gsub(" ", "", a[s]); if(!c[a[s]]++) printf("%s;", a[s])}}' test.txt
Tested on your input, returns:
chr1 66999275 67216822 + SGIP1;MIR3117;
Upvotes: 0