djordje
djordje

Reputation: 349

Renaming a file if it fulfills two conditions

I am reading many pdb files from current directory, in each of the file I am concentrating on specific lines and I am putting two conditions and if these conditions are fulfiled then I am trying to rename the file by adding certain elements from the line to its existing filename. My lines that I am operating with look like this:

HET     CA  A 800       1                                                       
HET    SO4  A 901       5                                                       
HET    SO4  A 902       5                                                       
HET    SO4  A 903       5                                                       
HET    RAM  A 509      11                                                       
HET    GTR  A 510      12                                                       
HET    RAM  A 511      15                                                       
HET    GTR  A 512      12                                                       
HET    RAM  A 513      15                                                       
HET    GTR  A 514      12                                                       
HETNAM      CA CALCIUM ION                                                      
HETNAM     SO4 SULFATE ION                                                      
HETNAM     RAM ALPHA-L-RHAMNOSE                                                 
HETNAM     GTR BETA-D-GALACTOPYRANURONIC ACID                                   
HETSYN     GTR GALACTURONIC ACID                                                
FORMUL   2   CA    CA 2+                                                        
FORMUL   3  SO4    3(O4 S 2-)                                                   
FORMUL   6  RAM    3(C6 H12 O5 N5)                                                 
FORMUL   6  GTR    3(C6 H10 O7)                                                 
FORMUL   7  HOH   *362(H2 O) 

So for example, from this input only RAM fulfills both conditions because it has 15 under HET and sum of C O N is =>6 in FORMUL, so I am expecting to grep RAM and to add it into filename that has been processed. So something like this : filename_RAM if there are more elements that fulfill both conditions then more elements will be added to the filename

I have written the script that looks like this:

#! usr/bin/env perl

use autodie;
use warnings;
use strict;
use File::Glob;

my $pdbs;
my $fh;
my @fh;
my @data;
my $c;
my @lines;
my $lines;
my $file_sz;
my $line;
my @colums;
my $colums;
my @het;
my $het;
my $cols;
my @cols;
my %letters;
my @hetnam;
my $hetnam;

foreach my $pdbs (glob '*pdb') #Reading each pdb file from the current directory
{
    printf "%s\n", $pdbs;
    open my $fh, "<" ,$pdbs;  #Read each pdb file into a filehandle
    #print "$fh\n";
    @lines= <$fh>;            #Putting content of each file into an array
    $file_sz = scalar @lines;
    #print "$file_sz\n";
    #print "@lines\n";

    for ($c=0; $c<=$#lines; $c++) #Reading each line
    {
        chomp ($lines[$c]);
        if ($lines[$c] =~ m/^HET /) #If line starts with HET 
        {   
            my @colums = split ' ', $lines[$c];
            # print join "\t", $colums [4];
            print "$colums[4]\n";       #Printing only a fifth column (4th element)
            push @hetnam, $colums[1];
        }

        if ($lines[$c] =~ m/^FORMUL /) #If line starts with FORMUL
        {   
            my @cols = split /\s+/, $lines[$c], 4;
            next unless $cols[0] eq 'FORMUL';   
            my %letters = $cols[-1] =~ m/([A-Z])(\d+)/g;
            $letters{$lines[$c]} = 0 for (qw[C O N]);
            next if $letters{C} <= 2
              and $letters{C} + $letters{O} + $letters{N} <= 6;
            #print "@cols\n";
        }

        if ($colums[4]=>6 && $letters{C} <= 2 && $letters{C} + $letters{O} + $letters{N} => 6) #line 61
        {
        system ("mv $pdbs $pdbs/_$hetnam");
        }
    }
}

The error that I am encountering is :

mv: accessing `4YE1.pdb/_': Not a directory
Use of uninitialized value $letters{"C"} in numeric le (<=) at script1 line 61, <$_[...]> line 5708.
Use of uninitialized value in addition (+) at script1 line 61, <$_[...]> line 5708.

Upvotes: 0

Views: 187

Answers (5)

tripleee
tripleee

Reputation: 189679

Your code didn't attempt to implement crucial parts of the logic you describe, and had various small logic bugs in other places. Here is an attempt at refactoring this into reasonably idiomatic Perl while keeping as much of your original as possible.

I have left out the actual renaming functionality, and left in some debugging prints to help you see how you might try to help yourself see what your program is doing while you are developing it.

#! usr/bin/env perl

use autodie;
use warnings;
use strict;

foreach my $pdb (<*pdb>)
{
    printf "# %s\n", $pdb;
    open my $fh, "<" ,$pdb;
    my %hets;
    my @suf;
    # Don't slurp the entire file. Just read a line at a time.
    for my $line (<$fh>)
    {
        chomp ($line);
        if ($line =~ m/^HET /)
        {   
            my @columns = split ' ', $line;
            ####print "$columns[4] < 6? ", ($columns[4] < 6 ? "yes" : "no"), "\n";
            next if $columns[4] < 6;
            # We have a HIT, er, HET -- remember it
            $hets{$columns[1]} = $columns[4];
            ####print "$columns[1] added to hets\n";
        }
        elsif ($line =~ m/^FORMUL /)
        {
            my @cols = split /\s+/, $line, 4;
            # If this is not in $hets, skip it
            ####print "$cols[2] in %hets? ", ($hets{$cols[2]} ? "yes" : "no"), "\n";
            next unless $hets{$cols[2]};
            ####print "\$cols[-1] is $cols[-1]\n";
            # Initialize these to zero _before_ extracting actual counts
            my %letters = (C=>0, O=>0, N=>0);
            while ($cols[-1] =~ m/([CON])(\d+)/g)
            {
                $letters{$1} = $2;
                ####print "\$letters{$1} = '$2'\n"
            }
            my $con = $letters{"C"} + $letters{"O"} + $letters{"N"};
            # Bug fix: next condition was wrong
            # ("next if this OR that" is equivalent to "don't next if this AND that")
            if ($letters{"C"}>2 && $con >= 6)
            {
                push @suf, $cols[2]
            }
        }
    }
    if (@suf)
    {
        print "rename $pdb, ", join("_", $pdb, @suf), "\n";
    }
    # my %hets and my @suf go out of scope here --
    # helps you find bugs and make sure you don't process old results
    # from a previous file
}

Declaring all your variabes as effectively global with my completely nullifies the reason to use my in the first place. The variable should be declared at the scope where you actually use it, and then forgotten when it goes out of scope. This is hard for beginners, but helps you down the line because you cannot stumble over your own feet by using the same variable for different things in different places, or operate on an old value from a previous iteration of the same loop (the value will be undef from the my declaration until you actually give it a value). And of course, declaring globals with my and then redeclaring them with my again in an enclosed scope is not only completely pointless, but also surely utterly confusing.

Apart from actually implementing many of the conditions you had in your requirements but not in your code, the major change here is collecting the matching suffixes while processing the lines in a file, then only attempting to rename it once we have processed the entire file.

Update by @Borodin

I have altered the code, hoping to explain my ideas in the comments. The semantics are identical: this is no refactoring, and I have no idea whether the logic is correct.

use strict;
use warnings;
use autodie;

for my $pdb ( glob '*pdb' ) {

    printf "# %s\n", $pdb;

    open my $fh, "<", $pdb;

    my %hets;
    my @suf;

    for my $line ( <$fh> ) {

        chomp( $line );

        if ( $line =~ m/^HET / ) {
            my @columns = split ' ', $line;
            next if $columns[4] < 6;

            $hets{ $columns[1] } = $columns[4];
        }
        elsif ( $line =~ m/^FORMUL / ) {

            my @cols = split /\s+/, $line, 4;

            next unless $hets{ $cols[2] };

            my %letters = ( C => 0, O => 0, N => 0 );

            while ( $cols[-1] =~ m/([CON])(\d+)/g ) {
                $letters{$1} = $2;
            }

            my $con = $letters{"C"} + $letters{"O"} + $letters{"N"};

            if ( $letters{"C"} > 2 && $con >= 6 ) {
                push @suf, $cols[2];
            }
        }
    }

    if ( @suf ) {
        print "rename $pdb, ", join( "_", $pdb, @suf ), "\n";
    }
}

Upvotes: 1

tripleee
tripleee

Reputation: 189679

Here's a quick stab at reimplementing what you are describing in shell + Awk.

#!/bin/bash

for file in *pdb; do
    if suffixes=$(awk 'BEGIN { suf = "" }
        # If first column is HET and number >= 6, remember this one
        $1 == "HET" && $5 >= 6 { het[$2] = $5 }
        # If first column is FORMUL and this is a HET we remembered ...
        $1 == "FORMUL" && ($3 in het) {
           # If there are parentheses, trim them and anything outside
           sub(/^.*\(/, "", $4);
           sub(/\).*/, "", $NF);
           # Now sum O, C, and N entries, but abort if C <= 2 or missing
           sum = 0
           c = 0
           for(i=4; i<=NF; ++i) {
             if ($i ~ /^[OCN]/) {
               n = 0 + substr($i, 2)
               if ($i ~ /^C/} {
                 if (n <= 2) next;
                 c=1;
               }
               sum += n
             }
           }
           # If we did not see any C, abort this line
           if (!c) next;
           # Now if sum >= 6, add suffix
           if (sum >= 6) suf = suf "_" $3 
        }
        # We are done. Print result if any.
        # Otherwise "exit 1" will cause the calling "if" to fail.
        END { if (suf) { print suf; exit 0 } exit 1 }' "$file")
    then
        mv "$file" "$file$suffixes"
    fi
done

This produces the suffix _RAM_GTR for your test data and I don't see why GTR should not be included if your criteria are correctly articulated.

I will reiterate from comments that one of the errors you had is the slash which you incorrectly added before the underscore in the destination file name. There is no way you can rename a file to a directory name, which is what the slash designates in Unix file names.

Upvotes: 0

Borodin
Borodin

Reputation: 126742

I can't be sure of a fix, because you haven't described in detail what your code should do. But the code block for FORMUL doesn't make sense, and I think it should be like this

if ( $lines[$c] =~ m/^FORMUL\b/ ) {
    my @cols = split ' ', $lines[$c], 4;
    my %letters = ( C => 0, O => 0, N => 0 );

    $letters{$1} = $2 while $cols[-1] =~ m/([A-Z])(\d+)/g;

    next if $letters{C} <= 2
        and $letters{C} + $letters{O} + $letters{N} <= 6;
}

Also note that your move operation

system ("mv $pdbs $pdbs/_$hetnam")

Is asking the system to create a directory with the same name as a file that already exists, which will fail. You can do this in three steps using tempdir from File::Temp:

use File:;Temp 'tempdir';

my $newdir = tempdir;
rename $pdbs, "$newdir/_$hetnam";
rename $newdir, $pdbs;

But note that you never set a value for $hetnam

Depending on the underlying platform, you may need to replace rename with move from File::Copy

Upvotes: 0

dlamblin
dlamblin

Reputation: 45351

It could be shell escaping in the system call, or a bug in how you've setup %letters.

You might have better success using the built in rename or the core module File::Copy. You could also change your system call to have 3 arguments, one for the process mv and one each for the source and destination. But I think you should just use the first two options.

Wait though, I see you're trying to move from $pdbs to $pdbs/_$hetname. This is going to be tricky, because the file exists, and then you try to move it to a path with its own name in a place were a directory should be? That's going to require multiple steps, like:

rename($pdbs, "$pdbs_$hetname");
mkdir($pdbs);
rename("$pdbs_$hetname", "$pdbs/_$hetname");

except I hope each $pdbs is unique, because otherwise this is going to fail the next time around.

Upvotes: 0

Zaid
Zaid

Reputation: 37146

  • mv: accessing ``4YE1.pdb/_': Not a directory

    The system call is trying to move (and rename, which it never gets round to) the file 4YE1.pdb under a folder of the same name, which it cannot find . If you intend to create a folder with the same name as the file (which I would highly discourage), you need to create it first with mkdir (or an equivalent).

  • Use of uninitialized value $letters{"C"} in numeric le (<=) at script1 line 61, <$_[...]> line 5708

    Changes/assignments made to %letters only exists in the scope of the if block as it was localized with my. To make it available to both if blocks, remove the my.

    So instead of:

    my %letters;
    if ( ... ) { my %letters = ... }
    if ( ... ) { next if $letters{C} ... }
    

    Write it as:

    my %letters;
    if ( ... ) { %letters = ... }
    if ( ... ) { next if $letters{C} ... }
    

Upvotes: 2

Related Questions