Reputation: 113

Dealing with the lone carriage return as end of line symbol

So I have a program that gets rid of extra line breaks in fasta files copy and pasted from the web. If you don't know what a fasta file should look like, it should be a greater than symbol followed by anything (this is usually title info), then new line. The new line should contain your complete sequence (for biology DNA or amino acid) in one line, and repeat.

Anyway, the problem is I need the program to be flexible enough to deal with anything:\r, \n, or \r\n. The chomp statement with underscores on either side is the command that removes the excess lines in the sequence portion. How can I make that chomp get rid of all three of the the options (\r, \n, \r\n)? Can I set $\ = @linefeeds and have @linefeeds = "\r", "\n", "\r\n";?

I have read up online, I know that this topic as been covered before, but I just can't seem to get it to work.

Here is my code to do so in a file:

print "Please enter file name, using the full pathway, to save your cleaned fasta file to:\n";
chomp( $new_file = <STDIN> );
open( New_File, "+>$new_file" ) or die "Couldn't create file. Check permissions on location.\n";

#process the file line by line, chomping all lines that do not contain "greater than" and
#removing all white space from lines that do not contain "greater than"

my $firstline = 1;
while ( my $lines = <FASTA> ) {
    foreach ($lines) {
        if ( !/>/ ) {
            _chomp($lines);_
            $lines =~ s/ //g;
            print New_File "$lines";
        } else {
            if ( $firstline == 1 ) {
                print New_File "$lines";
                $firstline = 0;
            } else {
                print New_File "\n$lines";
                next;
            }
        }
    }
}

Upvotes: 3

Answers (3)

DVK

Reputation: 129423

There are three issues to address from your question:

Technical question of how to strip whitespace INCLUDING assorted newlines from a string
A general question of how to process the file format described. I will present a different solution which works if the file size is small enough that you can slurp the whole file into a string in memory.
Reading in the file in chunks (e.g. line-by-line), to avoid slurping the whole file into memory.

To strip from a non-title line both the whitespace and assorted newlines (e.g. your _chomp_) line, you can do:
```
$lines =~ s/[\n\r]|\s//gs; # IIRC, \s doesn't include newlines
```

In addition, if your file is small enough that slurping it all into memory as a single long string in is an option, you can (at the cost of slightly slower code), have a shorter, hopefully more readable logic instead of the logic in your sample code:

my @lines = split(/(\015|\012|\015\012)>/, $text); # Split on ">" first line char
foreach my $line (@lines) {
    my ($title, $rest) = ($line =~ /^(>[^\n\r]+)[\n\r](.*)$/s);
    $rest =~ s/[\n\r]|\s//gs; # Strip whitespace AND newlines.
    print New_File "$title\n$rest\n";
}

However, if the data is large enough that you must read it in chunks (in case of text, the chunk is usually one line), you have a problem, with BOTH your proposed code and the code I showed above.

Perl's standard line-by-line reading via <> operator (or readline) will use input record separator ($/) to define what is a newline, which is "\n" by default. If your file is all "\r" separated, it will be treated as a giant single line, meaning you will slurp the file in whether you like it or not. Obviously, changing $/ to "\r" won't help.

Unfortunately, $/ (input record separator) must be a string and can not be a regular expression.

Therefore, if you absolutely MUST read the file with arbitrary newlines in chunks due to size consideration, you need to read file in fixed block sizes instead of line by line, and then parse out individual lines from those blocks.

To do such reading, IIRC, you can set $/ to an integer and then use readline() / <>.

Please note that the module mentioned by cjm's answer (PerlIO::eol) does exactly the latter approach, but it is implemented as an XS module and thus does it in C code (PerlIOEOL_get_base() function has buffer size 4k).

Upvotes: 2

cjm

Reputation: 62109

The fundamental problem is that $/ can only be set to a single string, and there's no value you can set it to that will match all of CR, LF, and CRLF line endings.

But, you aren't the first person with this problem. I haven't tried it myself, but if you install PerlIO::eol, you should be able to say:

binmode FASTA, ":raw:eol(LF)";

and it will automagically convert CR, LF, or CRLF line endings to LF for you.

Upvotes: 3

AdrianHHH

Reputation: 14047

I tend to use s/[\r\n]+$//;. When I also want to delete trailing white space I actually use s/[\s\r\n]+$//;.

From the Perl manuals, it would be sufficient to say s/\s+$//; as \s includes both \r and \n but I like the clarity of spelling it out.

Upvotes: 2

Dealing with the lone carriage return as end of line symbol

Answers (3)

Related Questions