Reputation: 57
i want to parse xls file using perl. The xls file has followed structure which makes parsing a bit tricky;
col1 col2
row1 School 1
row2 Dean John
row3 No.stu. 55
row4 some irrelevant stuff
row5 School2 2
row6 Dean Tony
row7 No. stu. 60
row8 some irrelevant stuff
the output i would like to achieve is:
col1 col2 col3
row1 School Dean No.stu.
row2 1 John 55
row3 2 Tony 60
The module i have been studying so far is Spreadsheet::ParseExcel
. Any other module would possibly help me out of here? Regards,
Thanks to @amon reply which provides a potential way to partially solve my problem. But as a perl beginner, i have a lot difficulties to digest the code.
the parsing part start being over my head from ROW:
, what is that used for? And i do not really know
my ($key, $val) = map {$worksheet->get_cell($row, $_)} $col_min .. $col_max;
can i interpret it as that given in the Spreadsheet::ParseExcel
documentation:
for my $row ( $row_min .. $row_max ) {
for my $col ( $col_min .. $col_max ) {
my $cell = $worksheet->get_cell( $row, $col );}
Also, before jumping into the output part, can i have a look at what have been parsed? Say, is there anyway to print out the variable which have been accumulated into the table %data
? I have been struggled for a while.
really appreciated all your help!
Upvotes: 0
Views: 1819
Reputation: 57600
You can use Spreadsheet::ParseExcel to read your file. Iterate over all rows, and store the first two fields in a hash. On every fourth line, you can write your data to the output, and clear the hash:
# Adapted from the module documentation
use strict; use warnings;
use Spreadsheet::ParseExcel;
my ($infile, $outfile) = @ARGV;
my $parser = Spreadsheet::ParseExcel->new();
my $workbook = $parser->parse($infile);
die $parser->error unless defined $workbook;
# select the first worksheet
my ($worksheet) = $workbook->worksheets();
# get bounds:
my ( $row_min, $row_max ) = $worksheet->row_range();
my ( $col_min, $col_max ) = $worksheet->col_range();
# assert that there are at least two fields per row:
$row_max - $row_min >= 1 or die "To few cells per row";
my %data; # accumulate data here
ROW:
for my $row ($row_min .. $row_max) {
# discard every fourth row:
if ($row - $row_min && ($row - $row_min) % 3 == 0) {
...; # write to output
%data = (); # clear cache
next ROW;
}
my ($key, $val) = map {$worksheet->get_cell($row, $_)} $col_min .. $col_max;
$data{$key} = $val;
}
For writing a spreadsheet, you can use Spreadsheet::WriteExcel. This would look like
# from the module documentation
my $out_workbook = Spreadsheet::WriteExcel->new($outfile);
my $out_worksheet = $out_workbook->add_worksheet;
...;
# write data inside our loop:
my @cols = qw/School Dean No.stu/;
for my $i (0 .. $#cols) {
my $val = delete $data{$cols[$i]} // die "uninitialized value for $cols[$i]";
$out_worksheet->write($row, $i, $val);
}
# do some error handling
if (my @keys = keys %data) {
die "Unexpected field(s) [@keys] encountered";
}
This requires perl5 v10 or later for the defined-or operator //
.
I am sorry that I used some constructs without explaining them properly.
I could keep a counter starting from one. Every time it hits 4
, I skip this row, and reset it. However, I already have a row counter, which I use instead. I do not know that the first row will be 0
, because $row_min
could be anything. So I transpose the row number $row - $row_min
to get the actual row count. It starts with zero.
Every fourth row, this actual count is divisible by three:
0 1 2 3 4 5 6 · · ·
* *
So I can use the modulus operator %
. However, 0 % $n == 0
is true for all $n
(zero is evenly divisible through all numbers), so I have to special case zero. I do this by checking that our count is not zero before I perform the divisibility test. All numbers except zero are true, so I can just test for the truthhness of our number. This leads to the test
if ($row - $row_min && ($row - $row_min) % 3 == 0) { ... }
map
expressionsThe map
function takes either of the following:
map EXPRESSION, LIST
map { BLOCK } LIST
-- note the lack of comma between the block and the list.It is very much like a nifty foreach-loop: For each value in the list, $_
is set to that value inside our expression. The expression then returns a value which is remembered. Once all items in the list are processed, map
returns a list of the values of the expression.
As an example, here is a map
expression that squares all numbers in the list:
my @squares = map { $_ * $_ } 1 .. 10; # 1, 4, 9 16, .. 100
I use the map
to fetch all cell values inside a row: I specify a list of all columns ($col_min .. $col_max
), and the map
block fetches the cell in that column for the current row.
So map
returns a list of cells, which I assign to the “Lvalue” list ($key, $val)
. The list assignment causes $key
to have the value of the first, and $val
the value of the second cell.
Written with a plain foreach
loop, this would have looked like:
my @cells;
for my $col ($col_min .. $col_max) {
push @cells, $worksheet->get_cell($row, $_);
}
my $key = shift @cells;
my $val = shift @cells;
The default method to dump data structures for debugging is to use the Data::Dumper module. If you want to look at hashes or arrays, make sure to pass the data structure as a reference. E.g.:
use Data::Dumper; # at the top of your script
warn Dumper \%data; # where ever you need the info
If you need better formatting, you can always write your own:
printf "Contents of %%data for row %d:\n", $row - $row_min;
for my $key (sort keys %data) {
printf "%10s:%s\n", $key, $data{$key}
}
This usage of the sort
function will sort its arguments in alphabetically ascending order.
Upvotes: 2