Reputation: 31
I have a very large tab separated file formated like below (with roughly 300 columns and 1.2 million rows). The columns are all formated the same after column 4 (just with different numbers, but are all formated with #/#
or ./.
)
The headings seem to be to long resulting in a bit of shifted look
CHROM POS REF ALT LarsenP1_1_PI173627.sort LarsenP1_1_PI173630.sort LarsenP1_1_PI204384.sort LarsenP1_1_PI204386.sort LarsenP1_1_PI204550.sort
1J_LG1 393466 A [C] ./. 1/2 1/1 0/0 0/1
1J_LG1 393472 G [A, C] 5/6 ./. ./. 1/0 0/0
1J_LG1 393475 G [A] ./. ./. ./. ./. ./.
1J_LG1 393482 T [C] 0/0 1/1 1/2 2/2 ./.
1J_LG1 393762 A [C] 0/0 ./. ./. 2/2 ./.
I am trying to remove all rows with the values ./.
in 60% or more of every column after column 4 (not including column 4, so column 5 and onward). Essentially in the example above rows 3 and 5 would be removed. It should produce an output as seen below.
CHROM POS REF ALT LarsenP1_1_PI173627.sort LarsenP1_1_PI173630.sort LarsenP1_1_PI204384.sort LarsenP1_1_PI204386.sort LarsenP1_1_PI204550.sort
1J_LG1 393466 A [C] ./. 1/2 1/1 0/0 0/1
1J_LG1 393472 G [A, C] 5/6 ./. ./. 1/0 0/0
1J_LG1 393482 T [C] 0/0 ./. ./. 2/2 ./.
I am very new to programming and bioinformatics and would really appreciate the help. From doing some research I think that I should be using something like awk
or sed
, but I'm not sure.
Also I'm sorry if the formatting isn't great.
Upvotes: 3
Views: 62
Reputation: 204015
Since ./.
and 0/0
only ever appear in the fields after #4 and are always a full field, all you need is:
$ awk -F'\t' 'gsub(/\.\/\./,"&") < ((NF-4)*.6)' file
CHROM POS REF ALT LarsenP1_1_PI173627.sort LarsenP1_1_PI173630.sort LarsenP1_1_PI204384.sort LarsenP1_1_PI204386.sort LarsenP1_1_PI204550.sort
1J_LG1 393466 A [C] ./. 1/2 1/1 0/0 0/1
1J_LG1 393472 G [A, C] 5/6 ./. ./. 1/0 0/0
1J_LG1 393482 T [C] 0/0 1/1 1/2 2/2 ./.
$ awk -F'\t' 'gsub(/\.\/\.|0\/0/,"&") < ((NF-4)*.6)' file
CHROM POS REF ALT LarsenP1_1_PI173627.sort LarsenP1_1_PI173630.sort LarsenP1_1_PI204384.sort LarsenP1_1_PI204386.sort LarsenP1_1_PI204550.sort
1J_LG1 393466 A [C] ./. 1/2 1/1 0/0 0/1
1J_LG1 393482 T [C] 0/0 1/1 1/2 2/2 ./.
Upvotes: 1
Reputation: 133640
1st solution: Could you please try following.
awk '{for(i=5;i<=NF;i++){val=$i;sum+=sub(/\.\/\./,"",val)};if((sum/(NF-4))*100<60){print};sum=""}' Input_file
OR(adding a non-one liner form of above solution)
awk '
{
for(i=5;i<=NF;i++){
val=$i
sum+=sub(/\.\/\./,"",val)
}
if((sum/(NF-4))*100<60){
print
}
sum=""
}' Input_file
2nd Solution: Or this could be more fast(though haven't checked on millions of lines).
awk '{for(i=5;i<=NF;i++){if($i ~ /^\.\/\.$/){sum++}};if((sum/(NF-4))*100<60){print};sum=""}' Input_file
OR(a non-one liner form of above solution):
awk '
{
for(i=5;i<=NF;i++){
if($i ~ /^\.\/\.$/){
sum++
}
}
if((sum/(NF-4))*100<60){
print
}
sum=""
}' Input_file
In case you have TAB delimited data then change awk
to awk 'BEGIN{FS=OFS="\t"}....
in above codes too.
Upvotes: 1