ALG
ALG

Reputation: 212

awk: print duplicate entries after a condition is met

I have a large file with different types of entries, separated by tabs:

## HEADER 1
## HEADER 2
## HEADER 3
#Col1   Col2    Col3
1_222_A/G   value1  ISO
1_222_A/G   value1  ISO
1_222_A/G   value1  ISO
1_222_A/G   value1  CANON
1_506_C/T   value2  ISO
1_506_C/T   value2  CANON
1_245_A/T   value3  SINGLE
2_1156_C/G  value4  ISO
2_1156_C/G  value4  ISO
2_1221_A/T/C    value5  ISO
2_1221_A/T/C    value5  ISO
2_1221_A/T/C    value5  CANON
2_1221_A/T/C    value5  CANON
3_787_G/T   value6  ISO
3_99089_A/C value7  ISO
3_99089_A/C value7  ISO
3_99089_A/C value7  CANON
4_12_T/C    value8   SINGLE
4_167_A/G   value9  ISO
4_167_A/G   value9  CANON
4_167_A/G   value9  CANON

I want to print everything but change the $3 value to "CANON_DUPL" in those entries meeting these conditions:

  1. NOT starting with #.
  2. $3 value must be "CANON".
  3. $1 value must be duplicated.

So the final table must be:

## HEADER 1
## HEADER 2
## HEADER 3
#Col1   Col2    Col3
1_222_A/G   value1  ISO
1_222_A/G   value1  ISO
1_222_A/G   value1  ISO
1_222_A/G   value1  CANON
1_506_C/T   value2  ISO
1_506_C/T   value2  CANON
1_245_A/T   value3  SINGLE
2_1156_C/G  value4  ISO
2_1156_C/G  value4  ISO
2_1221_A/T/C    value5  ISO
2_1221_A/T/C    value5  ISO
2_1221_A/T/C    value5  CANON_DUPL
2_1221_A/T/C    value5  CANON_DUPL
3_787_G/T   value6  ISO
3_99089_A/C value7  ISO
3_99089_A/C value7  ISO
3_99089_A/C value7  CANON
4_12_T/C    value8  SINGLE
4_167_A/G   value9  ISO
4_167_A/G   value9  CANON_DUPL
4_167_A/G   value9  CANON_DUPL

I tried it using awk but I only got to meet the two first conditions:

> awk 'BEGIN {FS=OFS="\t"}; !/#/$3~"CANON"{$3="CANON_DUPL"} {print $0}' file.txt
## HEADER 1
## HEADER 2
## HEADER 3
#Col1   Col2    Col3
1_222_A/G   value1  ISO
1_222_A/G   value1  ISO
1_222_A/G   value1  ISO
1_222_A/G   value1  CANON_DUPL #should not be modified
1_506_C/T   value2  ISO
1_506_C/T   value2  CANON_DUPL #should not be modified
1_245_A/T   value3  SINGLE
2_1156_C/G  value4  ISO
2_1156_C/G  value4  ISO
2_1221_A/T/C    value5  ISO
2_1221_A/T/C    value5  ISO
2_1221_A/T/C    value5  CANON_DUPL
2_1221_A/T/C    value5  CANON_DUPL
3_787_G/T   value6  ISO
3_99089_A/C value7  ISO
3_99089_A/C value7  ISO
3_99089_A/C value7  CANON_DUPL #should not be modified
4_12_T/C    value8  SINGLE
4_167_A/G   value9  ISO
4_167_A/G   value9  CANON_DUPL
4_167_A/G   value9  CANON_DUPL

I don't know if solutions out from awk are easier to implement.
Any thoughts?

Note: Edited to reflect better the file structure.

Upvotes: 1

Views: 98

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133770

With your shown samples, could you please try following. This will require memory(for large datasets) since its reading Input_file twice. In case your actual Input_file is tab separated then add change awk to awk 'BEGIN{FS=OFS="\t"} in following code.

awk '
(FNR==1 || FNR==2 || FNR==3 ){
  if(++count<=3){ print }
  next
}
FNR==NR{
  arr[$1,$3]++
  next
}
arr[$1,$3]>1 && $0!~/^#/ && $3=="CANON"{
  $3="CANON_DUPL"
}
1
'  Input_file  Input_file

Explanation: Adding detailed explanation for above.

awk '                                     ##Starting awk program from here.
(FNR==1 || FNR==2 || FNR==3 ){            ##Checking condition if line is 1 2 or 3 here.
  if(++count<=3){ print }                 ##If count is lesser or equals to 3 then print it.
  next                                    ##next will skip all further statements from here.
}
FNR==NR{                                  ##Checking condition which will be TRUE when 1st time Input_file is being read.
  arr[$1,$3]++                            ##Creating arr with index of $1,$3 and keep increasing its value by 1 here.
  next                                    ##next will skip all further statements from here.
}
arr[$1,$3]>1 && $0!~/^#/ && $3=="CANON"{  ##Checking condition if arr with 1st,3rd field value is greater than 1 AND line not starting with # AND 3rd column is CANON then do following.
  $3="CANON_DUPL"                         ##Set 3rd field to CANON_DUPL here.
}
1                                         ##printing current line here.
' Input_file  Input_file                  ##Mentioning Input_file names here.

Upvotes: 2

Thor
Thor

Reputation: 47239

Here is a one-pass solution:

parse.awk

NR<=4 { print; next }
NR==5 { P1=$1; P2=$2; P3=$3; next }
$1 == P1 && $3 == "CANON" && P3 == "CANON" { $3 = P3 = "CANON_DUPL" }
{ print P1, P2, P3; P1=$1; P2=$2; P3=$3 }
END { print P1, P2, P3 }

Run it like this:

awk -f parse.awk infile OFS='\t'

Output:

## HEADER 1
## HEADER 2
## HEADER 3
#Col1   Col2    Col3
1_222_A/G   value1  ISO
1_222_A/G   value1  ISO
1_222_A/G   value1  ISO
1_222_A/G   value1  CANON
1_506_C/T   value2  ISO
1_506_C/T   value2  CANON
1_245_A/T   value3  SINGLE
2_1156_C/G  value4  ISO
2_1156_C/G  value4  ISO
2_1221_A/T/C    value5  ISO
2_1221_A/T/C    value5  ISO
2_1221_A/T/C    value5  CANON_DUPL
2_1221_A/T/C    value5  CANON_DUPL
3_787_G/T   value6  ISO
3_99089_A/C value7  ISO
3_99089_A/C value7  ISO
3_99089_A/C value7  CANON
4_12_T/C    value8  SINGLE
4_167_A/G   value9  ISO
4_167_A/G   value9  CANON_DUPL
4_167_A/G   value9  CANON_DUPL

Upvotes: 0

Related Questions