Reputation: 518

Deleting lines with more than 30% lowercase letters

I try to process some data but I'am unable to find a working solution for my problem. I have a file which looks like:

>ram
cacacacacacacacacatatacacatacacatacacacacacacacacacacacacaca
cacacacacacacaca
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
>sam
AATTGGCCAATTGGCAATTCCGGAATTCaattggccaattccggaattccaattccgg

and many lines more....

I want to filter out all the lines and the corresponding headers (header starts with >) where the sequence string (those not starting with >) are containing 30 or more percent lowercase letters. And the sequence strings can span multiple lines.

So after command xy the output should look like:

>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct

I tried some mix of a while loop for reading the input file and then working with awk, grep, sed but there was no good outcome.

Upvotes: 3

Answers (4)

ceving

Reputation: 23866

Nowadays I would not use sed or awk anymore for anything longer than 2 lines.

#! /usr/bin/perl
use strict;                                # Force variable declaration.
use warnings;                              # Warn about dangerous language use.

sub filter                                 # Declare a sub-routing, a function called `filter`.
{
  my ($header, $body) = @_;                # Give the first two function arguments the names header and body.
  my $lower = $body =~ tr/a-z//;           # Count the translation of the characters a-z to nothing.
  print $header, $body, "\n"               # Print header, body and newline,
    unless $lower / length ($body) > 0.3;  # unless lower characters have more than 30%.
}

my ($header, $body);                       # Declare two variables for header and body.
while (<>) {                               # Loop over all lines from stdin or a file given in the command line.
  if (/^>/) {                              # If the line starts with >,
    filter ($header, $body)                # call filter with header and body,
      if defined $header;                  # if header is defined, which is not the case at the beginning of the file.
    ($header, $body) = ($_, '');           # Assign the current line to header and an empty string to body.
  } else {
    chomp;                                 # Remove the newline at the end of the line.
    $body .= $_;                           # Append the line to body.
  }
}
filter ($header, $body);                   # Filter the last record.

Upvotes: 1

jas

Reputation: 10865

Here's one idea, which sets the record separator to ">" to treat each header with its sequence lines as a single record.

Because the input starts with a ">", which causes an initial empty record, we guard the computation with NR > 1 (record number greater than one).

To count the number of characters we add the lengths of all the lines after the header. To count the number of lower-case characters, we save the string in another variable and use gsub to replace all the lower-case letters with nothing --- just because gsub returns the number of substitutions made, which is a convenient way of counting them.

Finally we check the ratio and print or not (adding back the initial ">" when we do print).

BEGIN { RS = ">" }

NR > 1 {
    total_cnt = 0
    lower_cnt = 0
    for (i=2; i<=NF; ++i) {
        total_cnt += length($i)
        s = $i
        lower_cnt += gsub(/[a-z]/, "", s)
    }
    ratio = lower_cnt / total_cnt
    if (ratio < 0.3) print ">"$0
}


$ awk -f seq.awk seq.txt
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct

Upvotes: 4

grail

Reputation: 930

Or:

awk '{n=length(gensub(/[A-Z]/,"","g"));if(NF && n/length*100 < 30)print a $0;a=RT}' RS='>[a-z]+\n' file

RS='>[a-z]+\n' - Sets the record separator to the line containing '>' and name
RT - This value is set by what is matched by RS above
a=RT - save previous RT value
n=length(gensub(/[A-Z]/,"","g")); - get the length of lower case chars
if(NF && n/length*100 < 30)print a $0; - check we have a value and that the percentage is less than 30 for lower case chars

Upvotes: 2

NeronLeVelu

Reputation: 10039

awk '/^>/{b=B;gsub( /[A-]/,"",b);
          if( length( b) < length( B) * 0.3) print H "\n" B
          H=$0;B="";next}

     {B=( (B != "") ? B "\n" : "" ) $0}

     END{ b=B;gsub( /[A-]/,"",b);
          if( length( b) < length( B) * 0.3) print H "\n" B
        }' YourFile

quick qnd dirty, a function suite better the need for printing

Upvotes: 1

Deleting lines with more than 30% lowercase letters

Answers (4)

Related Questions