How do I remove an entire row if more than 60% or more of columns after a specific column contains the same specific value?

Question

I have a very large tab separated file formated like below (with roughly 300 columns and 1.2 million rows). The columns are all formated the same after column 4 (just with different numbers, but are all formated with #/# or ./.)

The headings seem to be to long resulting in a bit of shifted look

CHROM   POS     REF     ALT     LarsenP1_1_PI173627.sort        LarsenP1_1_PI173630.sort        LarsenP1_1_PI204384.sort        LarsenP1_1_PI204386.sort        LarsenP1_1_PI204550.sort
1J_LG1  393466  A       [C]     ./.     1/2     1/1     0/0     0/1     
1J_LG1  393472  G       [A, C]  5/6     ./.     ./.     1/0     0/0         
1J_LG1  393475  G       [A]     ./.     ./.     ./.     ./.     ./.     
1J_LG1  393482  T       [C]     0/0     1/1     1/2     2/2     ./.
1J_LG1  393762  A       [C]     0/0     ./.     ./.     2/2     ./.

I am trying to remove all rows with the values ./. in 60% or more of every column after column 4 (not including column 4, so column 5 and onward). Essentially in the example above rows 3 and 5 would be removed. It should produce an output as seen below.

CHROM   POS     REF     ALT     LarsenP1_1_PI173627.sort        LarsenP1_1_PI173630.sort        LarsenP1_1_PI204384.sort        LarsenP1_1_PI204386.sort        LarsenP1_1_PI204550.sort
1J_LG1  393466  A       [C]     ./.     1/2     1/1     0/0     0/1
1J_LG1  393472  G       [A, C]  5/6     ./.     ./.     1/0     0/0
1J_LG1  393482  T       [C]     0/0     ./.     ./.     2/2     ./.

I am very new to programming and bioinformatics and would really appreciate the help. From doing some research I think that I should be using something like awk or sed, but I'm not sure.

Also I'm sorry if the formatting isn't great.

Ed Morton · Accepted Answer

Since ./. and 0/0 only ever appear in the fields after #4 and are always a full field, all you need is:

$ awk -F'	' 'gsub(/\./\./,"&") < ((NF-4)*.6)' file
CHROM   POS     REF     ALT     LarsenP1_1_PI173627.sort        LarsenP1_1_PI173630.sort        LarsenP1_1_PI204384.sort        LarsenP1_1_PI204386.sort     LarsenP1_1_PI204550.sort
1J_LG1  393466  A       [C]     ./.     1/2     1/1     0/0     0/1
1J_LG1  393472  G       [A, C]  5/6     ./.     ./.     1/0     0/0
1J_LG1  393482  T       [C]     0/0     1/1     1/2     2/2     ./.

$ awk -F'	' 'gsub(/\./\.|0/0/,"&") < ((NF-4)*.6)' file
CHROM   POS     REF     ALT     LarsenP1_1_PI173627.sort        LarsenP1_1_PI173630.sort        LarsenP1_1_PI204384.sort        LarsenP1_1_PI204386.sort     LarsenP1_1_PI204550.sort
1J_LG1  393466  A       [C]     ./.     1/2     1/1     0/0     0/1
1J_LG1  393482  T       [C]     0/0     1/1     1/2     2/2     ./.

How do I remove an entire row if more than 60% or more of columns after a specific column contains the same specific value?

Answers (2)

Related Questions