D. Nedaie
D. Nedaie

Reputation: 31

How do I remove an entire row if more than 60% or more of columns after a specific column contains the same specific value?

I have a very large tab separated file formated like below (with roughly 300 columns and 1.2 million rows). The columns are all formated the same after column 4 (just with different numbers, but are all formated with #/# or ./.)

The headings seem to be to long resulting in a bit of shifted look

CHROM   POS     REF     ALT     LarsenP1_1_PI173627.sort        LarsenP1_1_PI173630.sort        LarsenP1_1_PI204384.sort        LarsenP1_1_PI204386.sort        LarsenP1_1_PI204550.sort
1J_LG1  393466  A       [C]     ./.     1/2     1/1     0/0     0/1     
1J_LG1  393472  G       [A, C]  5/6     ./.     ./.     1/0     0/0         
1J_LG1  393475  G       [A]     ./.     ./.     ./.     ./.     ./.     
1J_LG1  393482  T       [C]     0/0     1/1     1/2     2/2     ./.
1J_LG1  393762  A       [C]     0/0     ./.     ./.     2/2     ./.   

I am trying to remove all rows with the values ./. in 60% or more of every column after column 4 (not including column 4, so column 5 and onward). Essentially in the example above rows 3 and 5 would be removed. It should produce an output as seen below.

CHROM   POS     REF     ALT     LarsenP1_1_PI173627.sort        LarsenP1_1_PI173630.sort        LarsenP1_1_PI204384.sort        LarsenP1_1_PI204386.sort        LarsenP1_1_PI204550.sort
1J_LG1  393466  A       [C]     ./.     1/2     1/1     0/0     0/1
1J_LG1  393472  G       [A, C]  5/6     ./.     ./.     1/0     0/0
1J_LG1  393482  T       [C]     0/0     ./.     ./.     2/2     ./.

I am very new to programming and bioinformatics and would really appreciate the help. From doing some research I think that I should be using something like awk or sed, but I'm not sure.

Also I'm sorry if the formatting isn't great.

Upvotes: 3

Views: 62

Answers (2)

Ed Morton
Ed Morton

Reputation: 204015

Since ./. and 0/0 only ever appear in the fields after #4 and are always a full field, all you need is:

$ awk -F'\t' 'gsub(/\.\/\./,"&") < ((NF-4)*.6)' file
CHROM   POS     REF     ALT     LarsenP1_1_PI173627.sort        LarsenP1_1_PI173630.sort        LarsenP1_1_PI204384.sort        LarsenP1_1_PI204386.sort     LarsenP1_1_PI204550.sort
1J_LG1  393466  A       [C]     ./.     1/2     1/1     0/0     0/1
1J_LG1  393472  G       [A, C]  5/6     ./.     ./.     1/0     0/0
1J_LG1  393482  T       [C]     0/0     1/1     1/2     2/2     ./.

$ awk -F'\t' 'gsub(/\.\/\.|0\/0/,"&") < ((NF-4)*.6)' file
CHROM   POS     REF     ALT     LarsenP1_1_PI173627.sort        LarsenP1_1_PI173630.sort        LarsenP1_1_PI204384.sort        LarsenP1_1_PI204386.sort     LarsenP1_1_PI204550.sort
1J_LG1  393466  A       [C]     ./.     1/2     1/1     0/0     0/1
1J_LG1  393482  T       [C]     0/0     1/1     1/2     2/2     ./.

Upvotes: 1

RavinderSingh13
RavinderSingh13

Reputation: 133640

1st solution: Could you please try following.

awk '{for(i=5;i<=NF;i++){val=$i;sum+=sub(/\.\/\./,"",val)};if((sum/(NF-4))*100<60){print};sum=""}'   Input_file

OR(adding a non-one liner form of above solution)

awk '
{
  for(i=5;i<=NF;i++){
    val=$i
    sum+=sub(/\.\/\./,"",val)
  }
  if((sum/(NF-4))*100<60){
    print
  }
  sum=""
}'  Input_file

2nd Solution: Or this could be more fast(though haven't checked on millions of lines).

awk '{for(i=5;i<=NF;i++){if($i ~ /^\.\/\.$/){sum++}};if((sum/(NF-4))*100<60){print};sum=""}'  Input_file

OR(a non-one liner form of above solution):

awk '
{
  for(i=5;i<=NF;i++){
    if($i ~ /^\.\/\.$/){
      sum++
    }
  }
  if((sum/(NF-4))*100<60){
      print
  }
  sum=""
}'   Input_file

In case you have TAB delimited data then change awk to awk 'BEGIN{FS=OFS="\t"}.... in above codes too.

Upvotes: 1

Related Questions