user2198367
user2198367

Reputation: 57

parsing an xls file using perl

i want to parse xls file using perl. The xls file has followed structure which makes parsing a bit tricky;

      col1      col2
row1  School    1
row2  Dean      John
row3  No.stu.   55
row4  some irrelevant stuff 
row5  School2   2
row6  Dean      Tony 
row7  No. stu.  60 
row8  some irrelevant stuff

the output i would like to achieve is:

      col1 col2 col3
row1 School Dean No.stu. 
row2 1      John  55
row3 2      Tony  60 

The module i have been studying so far is Spreadsheet::ParseExcel. Any other module would possibly help me out of here? Regards,

Thanks to @amon reply which provides a potential way to partially solve my problem. But as a perl beginner, i have a lot difficulties to digest the code.

the parsing part start being over my head from ROW:, what is that used for? And i do not really know

my ($key, $val) = map {$worksheet->get_cell($row, $_)} $col_min .. $col_max;

can i interpret it as that given in the Spreadsheet::ParseExcel documentation:

for my $row ( $row_min .. $row_max ) {

for my $col ( $col_min .. $col_max ) {

my $cell = $worksheet->get_cell( $row, $col );}

Also, before jumping into the output part, can i have a look at what have been parsed? Say, is there anyway to print out the variable which have been accumulated into the table %data? I have been struggled for a while.

really appreciated all your help!

Upvotes: 0

Views: 1819

Answers (1)

amon
amon

Reputation: 57600

You can use Spreadsheet::ParseExcel to read your file. Iterate over all rows, and store the first two fields in a hash. On every fourth line, you can write your data to the output, and clear the hash:

# Adapted from the module documentation
use strict; use warnings;
use Spreadsheet::ParseExcel;

my ($infile, $outfile) = @ARGV;

my $parser   = Spreadsheet::ParseExcel->new();
my $workbook = $parser->parse($infile);

die $parser->error unless defined $workbook;

# select the first worksheet
my ($worksheet) = $workbook->worksheets();

# get bounds:
my ( $row_min, $row_max ) = $worksheet->row_range();
my ( $col_min, $col_max ) = $worksheet->col_range();

# assert that there are at least two fields per row:
$row_max - $row_min >= 1 or die "To few cells per row";

my %data; # accumulate data here

ROW:
for my $row ($row_min .. $row_max) {
  # discard every fourth row:
  if ($row - $row_min && ($row - $row_min) % 3 == 0) {
    ...; # write to output
    %data = (); # clear cache
    next ROW;
  }
  my ($key, $val) = map {$worksheet->get_cell($row, $_)} $col_min .. $col_max;
  $data{$key} = $val;
}

For writing a spreadsheet, you can use Spreadsheet::WriteExcel. This would look like

# from the module documentation
my $out_workbook  = Spreadsheet::WriteExcel->new($outfile);
my $out_worksheet = $out_workbook->add_worksheet;
...;
# write data inside our loop:
my @cols = qw/School Dean No.stu/;
for my $i (0 .. $#cols) {
  my $val = delete $data{$cols[$i]} // die "uninitialized value for $cols[$i]";
  $out_worksheet->write($row, $i, $val);
}
# do some error handling
if (my @keys = keys %data) {
  die "Unexpected field(s) [@keys] encountered";
}

This requires perl5 v10 or later for the defined-or operator //.


Update:

I am sorry that I used some constructs without explaining them properly.

Discarding every forth row

I could keep a counter starting from one. Every time it hits 4, I skip this row, and reset it. However, I already have a row counter, which I use instead. I do not know that the first row will be 0, because $row_min could be anything. So I transpose the row number $row - $row_min to get the actual row count. It starts with zero.

Every fourth row, this actual count is divisible by three:

0 1 2 3 4 5 6 · · ·
      *     *

So I can use the modulus operator %. However, 0 % $n == 0 is true for all $n (zero is evenly divisible through all numbers), so I have to special case zero. I do this by checking that our count is not zero before I perform the divisibility test. All numbers except zero are true, so I can just test for the truthhness of our number. This leads to the test

if ($row - $row_min && ($row - $row_min) % 3 == 0) { ... }

map expressions

The map function takes either of the following:

  • map EXPRESSION, LIST
  • map { BLOCK } LIST -- note the lack of comma between the block and the list.

It is very much like a nifty foreach-loop: For each value in the list, $_ is set to that value inside our expression. The expression then returns a value which is remembered. Once all items in the list are processed, map returns a list of the values of the expression.

As an example, here is a map expression that squares all numbers in the list:

my @squares = map { $_ * $_ } 1 .. 10; # 1, 4, 9 16, .. 100

I use the map to fetch all cell values inside a row: I specify a list of all columns ($col_min .. $col_max), and the map block fetches the cell in that column for the current row.

So map returns a list of cells, which I assign to the “Lvalue” list ($key, $val). The list assignment causes $key to have the value of the first, and $val the value of the second cell.

Written with a plain foreach loop, this would have looked like:

my @cells;
for my $col ($col_min .. $col_max) {
  push @cells, $worksheet->get_cell($row, $_);
}
my $key = shift @cells;
my $val = shift @cells;

Looking at your data structures

The default method to dump data structures for debugging is to use the Data::Dumper module. If you want to look at hashes or arrays, make sure to pass the data structure as a reference. E.g.:

use Data::Dumper;   # at the top of your script
warn Dumper \%data; # where ever you need the info

If you need better formatting, you can always write your own:

printf "Contents of %%data for row %d:\n", $row - $row_min;
for my $key (sort keys %data) {
  printf "%10s:%s\n", $key, $data{$key}
}

This usage of the sort function will sort its arguments in alphabetically ascending order.

Upvotes: 2

Related Questions