Reputation: 3022
I am trying to use awk
to remove duplicate lines in a tab-delimited
file if they have the $2
value is Fusion
and the same $4
value is in each line. In the example below, lines 1 and 2 have the same $2
value and there $4
value is also the same, so the duplicate line 2 is removed. Line 3 and four also follow this logic. The amount of lines may be variable, but the format will be the same. Since lines 5 and 6 do not have Fusion
in $2
they are skipped and printed in output. Thank you :).
file
chr12:12006495-chr15:88483984 Fusion Gain-of-Function ETV6NTRK3-E4N15 1868
chr15:88483984-chr12:12006495 Fusion Gain-of-Function ETV6NTRK3-E4N15 1868
chr12:12022903-chr15:88483984 Fusion Gain-of-Function ETV6NTRK3-E5N15 414833
chr15:88483984-chr12:12022903 Fusion Gain-of-Function ETV6NTRK3-E5N15 414833
chr10 SNV ....
chr15 SNV ....
awk
awk -F'\t' '{if($2 in a)a[$2]=$2=="Fusion"?$0:a[$4];else a[$4]=$0}END{for(i in a)print a[i]}' file
desired output
chr12:12006495-chr15:88483984 Fusion Gain-of-Function ETV6NTRK3-E4N15 1868
chr12:12022903-chr15:88483984 Fusion Gain-of-Function ETV6NTRK3-E5N15 414833
chr10 SNV ....
chr15 SNV ....
Upvotes: 1
Views: 261
Reputation: 440576
awk -F'\t' '!($2 == "Fusion" && seen[$4]++)' file
$2 == "Fusion" && seen[$4]++
matches lines whose 2nd field is equal to Fusion
and whose 4th field has been seen at least once before.
seen[$4]++
is a common Awk idiom that incrementally builds an associative array of field values by adding entries on demand and recording the occurrence count of each value. The post-decrement (...++
) ensures that on encountering a given value for the first time seen[$4]++
evaluates to (conceptual) false, whereas all subsequent occurrences imply true.!(...)
negates the logic, evaluating to (conceptual) true only if:
Fusion
The whole !(...)
expression is a pattern in Awk terminology, and a pattern that has no associated action (a { ... }
block) defaults to printing the input record at hand
(action { print }
is implied).
Tip of the hat to Ed Morton for his help.
Upvotes: 3
Reputation: 1371
This seemed to work for me:
awk -F'\t' '{if($2 == "FUSION")a[$4] = $0; else b[$0]=$0;}END{for(k in a)print a[k];for(l in b)print b[l];}' file
The only issue is that it reorders things so that all the $2 == "FUSION"
cases come first.
Upvotes: 1