Reputation: 21
I wanna read a file to extract a few lines of information. I have created a do .. until to ignore the file lines until I reach the part I'm actually interested in, which contains the word V2000. i prefer to use a general regex rather than look for V2000.
The match is found but it doesn't break out the do .. until loop and therefore I'm unable to extract the info that comes right after that
Does anyone know why?
do {$line = <IN_SDF>;} until ($line =~ m/V\d+/);
and the rest of the code is:
my @aline = split ('', $line);
my $natoms = $aline[0];
my $out= shift;
do{
<IN_SDF>;
@aline = split ('', $_);
print OUT_3D $aline[3]."\t".$aline[0]."\t".$aline[1]."\t".$aline[2]."\n";
} until --$natoms == 0;
Upvotes: 2
Views: 9394
Reputation: 1202
I ran across this while trying to parse a broken, single line 50MB XML file. I wrote my own sub to do this although I don't know if it works for the original poster:
sub ReadNext($$) {
my ($hh, $pattern) = @_;
my ($buffer, $chunk, $chunkSize) = ('', '', 512);
while(my $bytesRead = read($hh, $chunk, $chunkSize) > 0) {
$buffer .= $chunk;
if ($buffer =~ $pattern) {
my ($matchStart, $matchEnd) = (@-, @+);
my $result = substr($buffer, $matchStart, $matchEnd - $matchStart);
my $pos = tell($hh);
# Rewind the stream to where this match left off
seek($hh, ($pos -= length($buffer)-$matchEnd), 0);
return $result;
}
}
undef;
}
open(my $fh, $ARGV[0]) or die("Could not open file: $!");
while(my $chunk = ReadNext($fh, qr/<RECORD>.+?<\/RECORD>/)) {
print $chunk, "\n";
}
close($fh);
Which for me prints out every RECORD element from the XML with a newline.
Upvotes: 0
Reputation: 118665
Are you assuming that a bare
<IN_SDF>
will load the next line from that filehandle into $_
? That is incorrect. You only get that behavior with a while
expression:
while (<IN_SDF>) is equivalent to while (defined($_=<IN_SDF>))
If you mean
$_ = <IN_SDF>
then say so.
For the first part of your question, this idiom:
while ($line = <IN_SDF>) {
last if $line =~ m/V\d+/;
}
is preferable to
do {
$line = <IN_SDF>
} until $line =~ m/V\d+/;
because the latter expression will go into an infinite loop when you run out of input (and $line
becomes undefined).
Upvotes: 14
Reputation: 29854
Let me get this straight.
'V'
followed by any number anywhere in the line. $natoms
, which is a single digit telling you how many lines to scan.Is that correct?
As for your breaking out of the loop problem, when I ran a version of that code, it worked fine for me. With strict or without.
Upvotes: 0