justaguy
justaguy

Reputation: 3022

awk to remove duplicates in specific lines based on keyword in field

I am trying to use awk to remove duplicate lines in a tab-delimited file if they have the $2 value is Fusion and the same $4 value is in each line. In the example below, lines 1 and 2 have the same $2 value and there $4 value is also the same, so the duplicate line 2 is removed. Line 3 and four also follow this logic. The amount of lines may be variable, but the format will be the same. Since lines 5 and 6 do not have Fusion in $2 they are skipped and printed in output. Thank you :).

file

chr12:12006495-chr15:88483984 Fusion Gain-of-Function ETV6NTRK3-E4N15 1868
chr15:88483984-chr12:12006495 Fusion Gain-of-Function ETV6NTRK3-E4N15 1868
chr12:12022903-chr15:88483984 Fusion Gain-of-Function ETV6NTRK3-E5N15 414833
chr15:88483984-chr12:12022903 Fusion Gain-of-Function ETV6NTRK3-E5N15 414833
chr10     SNV     ....
chr15     SNV     ....

awk

awk -F'\t' '{if($2 in a)a[$2]=$2=="Fusion"?$0:a[$4];else a[$4]=$0}END{for(i in a)print a[i]}' file

desired output

chr12:12006495-chr15:88483984 Fusion Gain-of-Function ETV6NTRK3-E4N15 1868
chr12:12022903-chr15:88483984 Fusion Gain-of-Function ETV6NTRK3-E5N15 414833
chr10     SNV     ....
chr15     SNV     ....

Upvotes: 1

Views: 261

Answers (2)

mklement0
mklement0

Reputation: 440576

awk -F'\t' '!($2 == "Fusion" && seen[$4]++)' file
  • $2 == "Fusion" && seen[$4]++ matches lines whose 2nd field is equal to Fusion and whose 4th field has been seen at least once before.

    • seen[$4]++ is a common Awk idiom that incrementally builds an associative array of field values by adding entries on demand and recording the occurrence count of each value. The post-decrement (...++) ensures that on encountering a given value for the first time seen[$4]++ evaluates to (conceptual) false, whereas all subsequent occurrences imply true.
  • !(...) negates the logic, evaluating to (conceptual) true only if:

    • the 2nd field does not equal Fusion
    • or the 4th field value is being seen for the first time.
  • The whole !(...) expression is a pattern in Awk terminology, and a pattern that has no associated action (a { ... } block) defaults to printing the input record at hand
    (action { print } is implied).

Tip of the hat to Ed Morton for his help.

Upvotes: 3

Heman Gandhi
Heman Gandhi

Reputation: 1371

This seemed to work for me:

awk -F'\t' '{if($2 == "FUSION")a[$4] = $0; else b[$0]=$0;}END{for(k in a)print a[k];for(l in b)print b[l];}' file

The only issue is that it reorders things so that all the $2 == "FUSION" cases come first.

Upvotes: 1

Related Questions