Reputation: 212
I have a large file with different types of entries, separated by tabs:
## HEADER 1
## HEADER 2
## HEADER 3
#Col1 Col2 Col3
1_222_A/G value1 ISO
1_222_A/G value1 ISO
1_222_A/G value1 ISO
1_222_A/G value1 CANON
1_506_C/T value2 ISO
1_506_C/T value2 CANON
1_245_A/T value3 SINGLE
2_1156_C/G value4 ISO
2_1156_C/G value4 ISO
2_1221_A/T/C value5 ISO
2_1221_A/T/C value5 ISO
2_1221_A/T/C value5 CANON
2_1221_A/T/C value5 CANON
3_787_G/T value6 ISO
3_99089_A/C value7 ISO
3_99089_A/C value7 ISO
3_99089_A/C value7 CANON
4_12_T/C value8 SINGLE
4_167_A/G value9 ISO
4_167_A/G value9 CANON
4_167_A/G value9 CANON
I want to print everything but change the $3 value to "CANON_DUPL" in those entries meeting these conditions:
So the final table must be:
## HEADER 1
## HEADER 2
## HEADER 3
#Col1 Col2 Col3
1_222_A/G value1 ISO
1_222_A/G value1 ISO
1_222_A/G value1 ISO
1_222_A/G value1 CANON
1_506_C/T value2 ISO
1_506_C/T value2 CANON
1_245_A/T value3 SINGLE
2_1156_C/G value4 ISO
2_1156_C/G value4 ISO
2_1221_A/T/C value5 ISO
2_1221_A/T/C value5 ISO
2_1221_A/T/C value5 CANON_DUPL
2_1221_A/T/C value5 CANON_DUPL
3_787_G/T value6 ISO
3_99089_A/C value7 ISO
3_99089_A/C value7 ISO
3_99089_A/C value7 CANON
4_12_T/C value8 SINGLE
4_167_A/G value9 ISO
4_167_A/G value9 CANON_DUPL
4_167_A/G value9 CANON_DUPL
I tried it using awk but I only got to meet the two first conditions:
> awk 'BEGIN {FS=OFS="\t"}; !/#/$3~"CANON"{$3="CANON_DUPL"} {print $0}' file.txt
## HEADER 1
## HEADER 2
## HEADER 3
#Col1 Col2 Col3
1_222_A/G value1 ISO
1_222_A/G value1 ISO
1_222_A/G value1 ISO
1_222_A/G value1 CANON_DUPL #should not be modified
1_506_C/T value2 ISO
1_506_C/T value2 CANON_DUPL #should not be modified
1_245_A/T value3 SINGLE
2_1156_C/G value4 ISO
2_1156_C/G value4 ISO
2_1221_A/T/C value5 ISO
2_1221_A/T/C value5 ISO
2_1221_A/T/C value5 CANON_DUPL
2_1221_A/T/C value5 CANON_DUPL
3_787_G/T value6 ISO
3_99089_A/C value7 ISO
3_99089_A/C value7 ISO
3_99089_A/C value7 CANON_DUPL #should not be modified
4_12_T/C value8 SINGLE
4_167_A/G value9 ISO
4_167_A/G value9 CANON_DUPL
4_167_A/G value9 CANON_DUPL
I don't know if solutions out from awk are easier to implement.
Any thoughts?
Note: Edited to reflect better the file structure.
Upvotes: 1
Views: 98
Reputation: 133770
With your shown samples, could you please try following. This will require memory(for large datasets) since its reading Input_file twice. In case your actual Input_file is tab separated then add change awk
to awk 'BEGIN{FS=OFS="\t"}
in following code.
awk '
(FNR==1 || FNR==2 || FNR==3 ){
if(++count<=3){ print }
next
}
FNR==NR{
arr[$1,$3]++
next
}
arr[$1,$3]>1 && $0!~/^#/ && $3=="CANON"{
$3="CANON_DUPL"
}
1
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
(FNR==1 || FNR==2 || FNR==3 ){ ##Checking condition if line is 1 2 or 3 here.
if(++count<=3){ print } ##If count is lesser or equals to 3 then print it.
next ##next will skip all further statements from here.
}
FNR==NR{ ##Checking condition which will be TRUE when 1st time Input_file is being read.
arr[$1,$3]++ ##Creating arr with index of $1,$3 and keep increasing its value by 1 here.
next ##next will skip all further statements from here.
}
arr[$1,$3]>1 && $0!~/^#/ && $3=="CANON"{ ##Checking condition if arr with 1st,3rd field value is greater than 1 AND line not starting with # AND 3rd column is CANON then do following.
$3="CANON_DUPL" ##Set 3rd field to CANON_DUPL here.
}
1 ##printing current line here.
' Input_file Input_file ##Mentioning Input_file names here.
Upvotes: 2
Reputation: 47239
Here is a one-pass solution:
parse.awk
NR<=4 { print; next }
NR==5 { P1=$1; P2=$2; P3=$3; next }
$1 == P1 && $3 == "CANON" && P3 == "CANON" { $3 = P3 = "CANON_DUPL" }
{ print P1, P2, P3; P1=$1; P2=$2; P3=$3 }
END { print P1, P2, P3 }
Run it like this:
awk -f parse.awk infile OFS='\t'
Output:
## HEADER 1
## HEADER 2
## HEADER 3
#Col1 Col2 Col3
1_222_A/G value1 ISO
1_222_A/G value1 ISO
1_222_A/G value1 ISO
1_222_A/G value1 CANON
1_506_C/T value2 ISO
1_506_C/T value2 CANON
1_245_A/T value3 SINGLE
2_1156_C/G value4 ISO
2_1156_C/G value4 ISO
2_1221_A/T/C value5 ISO
2_1221_A/T/C value5 ISO
2_1221_A/T/C value5 CANON_DUPL
2_1221_A/T/C value5 CANON_DUPL
3_787_G/T value6 ISO
3_99089_A/C value7 ISO
3_99089_A/C value7 ISO
3_99089_A/C value7 CANON
4_12_T/C value8 SINGLE
4_167_A/G value9 ISO
4_167_A/G value9 CANON_DUPL
4_167_A/G value9 CANON_DUPL
Upvotes: 0