Reputation: 21
I am trying to write a script to parse a trial balance sheet. The layout of each line in the file is always the same but I am having an issue getting my regex to match properly. The first 10 characters of the line are always the account number. Here is an example:
0000000099 S000 Doe, John 00 1,243.22 01/01/1901
I am trying to capture each of these to columns to a separate variable, but my expressions aren't working.
Here is what I have so far.
#!/usr/bin/perl -w
use strict;
my $filename = "S:\\TELLERS\\GalaxyDown\\tbal";
my $answer = undef;
open(FIN, $filename) || die "File not found";
do {
print "Enter an account number: ";
chomp(my $acctNum = <STDIN>);
if ($acctNum =~ /\d{1,10}/) {
$acctNum = pad_zeros($acctNum);
#print "$acctNum\n"; #test to make sure the padding extends the account
#number to 10 digits - comment out after verification
while (<FIN>) {
#print "$_\n";
if (m/(^[0-9]{10}/) {
print "Passed\n";
}
else {
print "Failed\n";
}
}
}
else {
print "Invalid account number. Please try again.\n";
}
print "Would you like to view another account balance? (yes/no): ";
chomp($answer = lc <STDIN>);
} while ($answer ne "no");
sub pad_zeros {
my $optimal_length = 10;
my $num = shift;
$num =~ s/^(\d+)$/("0"x($optimal_length-length$1)).$1/e;
return $num;
}
Any help would be appreciated.
Upvotes: 2
Views: 115
Reputation: 126762
There is nothing blatantly wrong with your code. You don't say what you mean by "not working", but I notice that you are reading through the file multiple times to search for the input. Once the end of the file has been reached you need to either seek
to the start again or reopen the file.
Here are some suggestions
Don't use the -w
command-line qualifier. use warnings
is far better
Use single quotes to delmit strings containing backslashes. Then they don't need escaping unless there is more than one of them together or they appear at the end of the string
You would make a lot of seasoned Perl programmers a lot happier if you used snake_case
instead of CamelCase
for your local identifiers
It is current best practice to use lexical file handles and the three-parameter form of open
. And you should put the $!
into your die
string so you can se why the open failed
You check for /\d{1,10}/
in your input, which tests whether the string contains a string of digits anywhere. You probably meant /^\d{1,10}$/
sub pad_zeroes
is better written as sprintf '%0*d', $optimal_length, $_[0]
Here is a suggested rewrite. I have altered the code to check whether the account specified by the input text has been read, which is presumably your intention.
Note that a sequential search through the file for every new account number entered is vastly inefficient and only feasible for a small data file or a one-off program. I recommend you use Tie::File
together with a hash that indicates which element of the tied array to read to access a given account number.
Note It appears that your file uses fixed-width fields, i.e. the fields always start and end at the same character positions in the lines. If so, then rather than use regular expression to process the data you should use substr
or unpack
. Even better, the module Parse::FixedLength
allows you simply to specify the length of each field and will do the rest of the work for you.
#!/usr/bin/perl
use strict;
use warnings;
my $filename = 'S:\TELLERS\GalaxyDown\tbal';
my $answer;
do {
print "Enter an account number: ";
chomp(my $acct_num = <STDIN>);
if ($acct_num =~ /^\d{1,10}$/) {
$acct_num = pad_zeroes($acct_num);
#print "$acct_num\n"; #test to make sure the padding extends the account
#number to 10 digits - comment out after verification
open(my $fin, '<', $filename) || die "File not found: $!";
while (<$fin>) {
if (/^$acct_num/) {
print "Passed\n";
}
}
}
else {
print "Invalid account number. Please try again.\n";
}
print "Would you like to view another account balance? (yes/no): ";
chomp($answer = lc <STDIN>);
} until $answer eq 'no';
sub pad_zeroes {
my $optimal_length = 10;
sprintf '%0*d', $optimal_length, $_[0];
}
Upvotes: 0
Reputation: 107090
I'm not getting any points for this. Amon has pretty much nailed it, and given you everything you need to know including some wonderful suggestions.
Your say your account line looks like this:
0000000099 S000 Doe, John 00 1,243.22 01/01/1901
The problem is that spaces can be used as part of a name. Mary Jane Von Corona has four spaces in it. However, it's a first name, Mary Jane, and a last name Von Corona. How do I know where the name is split?
The best way is to either use a fixed length field, or use a separator that isn't in the file.
0000000099|S000|Doe|John|00|1,243.22|01/01/1901
Here, I'm using |
as field separators. I could do this:
my ( $account, $something, $something2,
$last, $first, $something3,
$balance, $date) = split /\|/, $line;
This is splitting the entire line in one shot on the |
.
If fields had a fixed width, I could use the substr function to pull apart the various fields in this line:
my $account = substr( $line, 0, 10 ); #First 10 characters is always the account number
I would also recommend using autodie. This way, you don't have to test various things like whether or not your file was successfully opened. Perl will automatically die (and usually with a nice error message) when things like this happen.
Upvotes: 1
Reputation: 8350
If you want to check the full line you can use something like this:
while(<FIN>){
if( @a = (m/^\s*(\d{1,10})\s+(S\d+)\s+(\w+)\s*,\s*(\w+)\s+(\d\d)\s+(\S+)\s+(\d\d?\/\d\d?\/(?:\d\d)\d\d)\s*/) ) {
$a[0] = sprintf "%010d", $a[0];
print "Account number: $a[0]";
print "Account series: $a[1]";
print "Account owner: $a[3] $a[2]";
print "Account type: $a[4]";
print "Account balance: $a[5]";
print "Account date: $a[6]";
} else {
print "Failed\n";
}
Any deviation from the required format will print "Failed" You can make adjustments according to your needs.
Upvotes: -1
Reputation: 57650
Your pad_zeros
function is really a longhand form for sprintf '%0*d', $optimal_length, $num
.
Your while(<FIN>)
loop reads all lines in the tbal file and prints for each line in that file whether that line starts with a ten digit number, but only for the first account number entered (The readline operator <>
is effectively an iterator, and is exhausted after you read all lines). The solution is to open the filehandle inside the if
branch.
There are a few other things that could be improved:
undef
: this is already their default value.To open a filehandle, you should (1) use a normal variable for that file handle, and (2) use the three-argument-form of open
:
open my $fin, "<", $filename or die "Can't open $filename: $!";
where $!
contains the reason why the open
failed. Specifying an explicit mode <
makes a few corner cases more secure.
S:/TELLERS/...
.To split a line into multiple fields, you have to think about the exact format: Is each field seperated by a common seperator, e.g a space? In that case,
my @fields = split " ", $line;
would do the trick. Change the " "
to a regex determining the seperator for a different seperator (tabs, commas, etc).
However, your format doesn't look that simple, because the comma after the surname likely isn't part of the data of the surname field (?)
A regex like
my $regex = qr{\A
\s* ([0-9]{10})
\s+ (S[0-9]{3})
\s+ ([^,]+), # the surname
\s+ ([^0-9]+(?<!\s)) # other names
\s+ ([0-9]{2})
\s+ ([0-9,]+\.[0-9]{2})
\s+ ([0-9]{2})
/ ([0-9]{2})
/ ([0-9]{4})
\s*\z
}x;
my @fields = $line =~ $regex;
might be better, but that depends on the exact format you have.
Matching names is difficult, as some peoply may have more than one name. Consider the entries Gogh, Vincent van
or Tucker, Charles III.
I decided to match “any non-numeric string that doesn't end with a space character”.
Upvotes: 1