Reputation: 349
I am reading many pdb
files from current directory, in each of the file I am concentrating on specific lines and I am putting two conditions and if these conditions are fulfiled then I am trying to rename the file by adding certain elements from the line to its existing filename. My lines that I am operating with look like this:
HET CA A 800 1
HET SO4 A 901 5
HET SO4 A 902 5
HET SO4 A 903 5
HET RAM A 509 11
HET GTR A 510 12
HET RAM A 511 15
HET GTR A 512 12
HET RAM A 513 15
HET GTR A 514 12
HETNAM CA CALCIUM ION
HETNAM SO4 SULFATE ION
HETNAM RAM ALPHA-L-RHAMNOSE
HETNAM GTR BETA-D-GALACTOPYRANURONIC ACID
HETSYN GTR GALACTURONIC ACID
FORMUL 2 CA CA 2+
FORMUL 3 SO4 3(O4 S 2-)
FORMUL 6 RAM 3(C6 H12 O5 N5)
FORMUL 6 GTR 3(C6 H10 O7)
FORMUL 7 HOH *362(H2 O)
HET
and FORMUL
.HET
lines I am reading last column or 4th element. From FORMUL
lines I am reading last column.HET
lines is >= 6
and if under FORMUL
there is C >= 2
&&
sum of O C N
is >= 6
then I want to grep the 3rd element under FORMUL
line and to put it into filename.So for example, from this input only RAM
fulfills both conditions because it has 15
under HET
and sum of C O N
is =>6
in FORMUL
, so I am expecting to grep RAM
and to add it into filename that has been processed. So something like this : filename_RAM
if there are more elements that fulfill both conditions then more elements will be added to the filename
I have written the script that looks like this:
#! usr/bin/env perl
use autodie;
use warnings;
use strict;
use File::Glob;
my $pdbs;
my $fh;
my @fh;
my @data;
my $c;
my @lines;
my $lines;
my $file_sz;
my $line;
my @colums;
my $colums;
my @het;
my $het;
my $cols;
my @cols;
my %letters;
my @hetnam;
my $hetnam;
foreach my $pdbs (glob '*pdb') #Reading each pdb file from the current directory
{
printf "%s\n", $pdbs;
open my $fh, "<" ,$pdbs; #Read each pdb file into a filehandle
#print "$fh\n";
@lines= <$fh>; #Putting content of each file into an array
$file_sz = scalar @lines;
#print "$file_sz\n";
#print "@lines\n";
for ($c=0; $c<=$#lines; $c++) #Reading each line
{
chomp ($lines[$c]);
if ($lines[$c] =~ m/^HET /) #If line starts with HET
{
my @colums = split ' ', $lines[$c];
# print join "\t", $colums [4];
print "$colums[4]\n"; #Printing only a fifth column (4th element)
push @hetnam, $colums[1];
}
if ($lines[$c] =~ m/^FORMUL /) #If line starts with FORMUL
{
my @cols = split /\s+/, $lines[$c], 4;
next unless $cols[0] eq 'FORMUL';
my %letters = $cols[-1] =~ m/([A-Z])(\d+)/g;
$letters{$lines[$c]} = 0 for (qw[C O N]);
next if $letters{C} <= 2
and $letters{C} + $letters{O} + $letters{N} <= 6;
#print "@cols\n";
}
if ($colums[4]=>6 && $letters{C} <= 2 && $letters{C} + $letters{O} + $letters{N} => 6) #line 61
{
system ("mv $pdbs $pdbs/_$hetnam");
}
}
}
The error that I am encountering is :
mv: accessing `4YE1.pdb/_': Not a directory
Use of uninitialized value $letters{"C"} in numeric le (<=) at script1 line 61, <$_[...]> line 5708.
Use of uninitialized value in addition (+) at script1 line 61, <$_[...]> line 5708.
Upvotes: 0
Views: 187
Reputation: 189679
Your code didn't attempt to implement crucial parts of the logic you describe, and had various small logic bugs in other places. Here is an attempt at refactoring this into reasonably idiomatic Perl while keeping as much of your original as possible.
I have left out the actual renaming functionality, and left in some debugging prints to help you see how you might try to help yourself see what your program is doing while you are developing it.
#! usr/bin/env perl
use autodie;
use warnings;
use strict;
foreach my $pdb (<*pdb>)
{
printf "# %s\n", $pdb;
open my $fh, "<" ,$pdb;
my %hets;
my @suf;
# Don't slurp the entire file. Just read a line at a time.
for my $line (<$fh>)
{
chomp ($line);
if ($line =~ m/^HET /)
{
my @columns = split ' ', $line;
####print "$columns[4] < 6? ", ($columns[4] < 6 ? "yes" : "no"), "\n";
next if $columns[4] < 6;
# We have a HIT, er, HET -- remember it
$hets{$columns[1]} = $columns[4];
####print "$columns[1] added to hets\n";
}
elsif ($line =~ m/^FORMUL /)
{
my @cols = split /\s+/, $line, 4;
# If this is not in $hets, skip it
####print "$cols[2] in %hets? ", ($hets{$cols[2]} ? "yes" : "no"), "\n";
next unless $hets{$cols[2]};
####print "\$cols[-1] is $cols[-1]\n";
# Initialize these to zero _before_ extracting actual counts
my %letters = (C=>0, O=>0, N=>0);
while ($cols[-1] =~ m/([CON])(\d+)/g)
{
$letters{$1} = $2;
####print "\$letters{$1} = '$2'\n"
}
my $con = $letters{"C"} + $letters{"O"} + $letters{"N"};
# Bug fix: next condition was wrong
# ("next if this OR that" is equivalent to "don't next if this AND that")
if ($letters{"C"}>2 && $con >= 6)
{
push @suf, $cols[2]
}
}
}
if (@suf)
{
print "rename $pdb, ", join("_", $pdb, @suf), "\n";
}
# my %hets and my @suf go out of scope here --
# helps you find bugs and make sure you don't process old results
# from a previous file
}
Declaring all your variabes as effectively global with my
completely nullifies the reason to use my
in the first place. The variable should be declared at the scope where you actually use it, and then forgotten when it goes out of scope. This is hard for beginners, but helps you down the line because you cannot stumble over your own feet by using the same variable for different things in different places, or operate on an old value from a previous iteration of the same loop (the value will be undef
from the my
declaration until you actually give it a value). And of course, declaring globals with my
and then redeclaring them with my
again in an enclosed scope is not only completely pointless, but also surely utterly confusing.
Apart from actually implementing many of the conditions you had in your requirements but not in your code, the major change here is collecting the matching suffixes while processing the lines in a file, then only attempting to rename it once we have processed the entire file.
I have altered the code, hoping to explain my ideas in the comments. The semantics are identical: this is no refactoring, and I have no idea whether the logic is correct.
use strict;
use warnings;
use autodie;
for my $pdb ( glob '*pdb' ) {
printf "# %s\n", $pdb;
open my $fh, "<", $pdb;
my %hets;
my @suf;
for my $line ( <$fh> ) {
chomp( $line );
if ( $line =~ m/^HET / ) {
my @columns = split ' ', $line;
next if $columns[4] < 6;
$hets{ $columns[1] } = $columns[4];
}
elsif ( $line =~ m/^FORMUL / ) {
my @cols = split /\s+/, $line, 4;
next unless $hets{ $cols[2] };
my %letters = ( C => 0, O => 0, N => 0 );
while ( $cols[-1] =~ m/([CON])(\d+)/g ) {
$letters{$1} = $2;
}
my $con = $letters{"C"} + $letters{"O"} + $letters{"N"};
if ( $letters{"C"} > 2 && $con >= 6 ) {
push @suf, $cols[2];
}
}
}
if ( @suf ) {
print "rename $pdb, ", join( "_", $pdb, @suf ), "\n";
}
}
Upvotes: 1
Reputation: 189679
Here's a quick stab at reimplementing what you are describing in shell + Awk.
#!/bin/bash
for file in *pdb; do
if suffixes=$(awk 'BEGIN { suf = "" }
# If first column is HET and number >= 6, remember this one
$1 == "HET" && $5 >= 6 { het[$2] = $5 }
# If first column is FORMUL and this is a HET we remembered ...
$1 == "FORMUL" && ($3 in het) {
# If there are parentheses, trim them and anything outside
sub(/^.*\(/, "", $4);
sub(/\).*/, "", $NF);
# Now sum O, C, and N entries, but abort if C <= 2 or missing
sum = 0
c = 0
for(i=4; i<=NF; ++i) {
if ($i ~ /^[OCN]/) {
n = 0 + substr($i, 2)
if ($i ~ /^C/} {
if (n <= 2) next;
c=1;
}
sum += n
}
}
# If we did not see any C, abort this line
if (!c) next;
# Now if sum >= 6, add suffix
if (sum >= 6) suf = suf "_" $3
}
# We are done. Print result if any.
# Otherwise "exit 1" will cause the calling "if" to fail.
END { if (suf) { print suf; exit 0 } exit 1 }' "$file")
then
mv "$file" "$file$suffixes"
fi
done
This produces the suffix _RAM_GTR
for your test data and I don't see why GTR
should not be included if your criteria are correctly articulated.
I will reiterate from comments that one of the errors you had is the slash which you incorrectly added before the underscore in the destination file name. There is no way you can rename a file to a directory name, which is what the slash designates in Unix file names.
Upvotes: 0
Reputation: 126742
I can't be sure of a fix, because you haven't described in detail what your code should do. But the code block for FORMUL
doesn't make sense, and I think it should be like this
if ( $lines[$c] =~ m/^FORMUL\b/ ) {
my @cols = split ' ', $lines[$c], 4;
my %letters = ( C => 0, O => 0, N => 0 );
$letters{$1} = $2 while $cols[-1] =~ m/([A-Z])(\d+)/g;
next if $letters{C} <= 2
and $letters{C} + $letters{O} + $letters{N} <= 6;
}
Also note that your move operation
system ("mv $pdbs $pdbs/_$hetnam")
Is asking the system to create a directory with the same name as a file that already exists, which will fail. You can do this in three steps using tempdir
from File::Temp
:
use File:;Temp 'tempdir';
my $newdir = tempdir;
rename $pdbs, "$newdir/_$hetnam";
rename $newdir, $pdbs;
But note that you never set a value for $hetnam
Depending on the underlying platform, you may need to replace rename
with move
from File::Copy
Upvotes: 0
Reputation: 45351
It could be shell escaping in the system call, or a bug in how you've setup %letters
.
You might have better success using the built in rename or the core module File::Copy. You could also change your system call to have 3 arguments, one for the process mv
and one each for the source and destination. But I think you should just use the first two options.
Wait though, I see you're trying to move from $pdbs
to $pdbs/_$hetname
. This is going to be tricky, because the file exists, and then you try to move it to a path with its own name in a place were a directory should be? That's going to require multiple steps, like:
rename($pdbs, "$pdbs_$hetname");
mkdir($pdbs);
rename("$pdbs_$hetname", "$pdbs/_$hetname");
except I hope each $pdbs
is unique, because otherwise this is going to fail the next time around.
Upvotes: 0
Reputation: 37146
mv: accessing ``4YE1.pdb/_': Not a directory
The system
call is trying to move (and rename, which it never gets round to) the file 4YE1.pdb
under a folder of the same name, which it cannot find . If you intend to create a folder with the same name as the file (which I would highly discourage), you need to create it first with mkdir
(or an equivalent).
Use of uninitialized value $letters{"C"} in numeric le (<=) at script1 line 61, <$_[...]> line 5708
Changes/assignments made to %letters
only exists in the scope of the if
block as it was localized with my
. To make it available to both if
blocks, remove the my
.
So instead of:
my %letters;
if ( ... ) { my %letters = ... }
if ( ... ) { next if $letters{C} ... }
Write it as:
my %letters;
if ( ... ) { %letters = ... }
if ( ... ) { next if $letters{C} ... }
Upvotes: 2