Reputation: 845
I have a huge xlsx file (aboutn 127 MB) and want to read using Spreadsheet::Excel
module, but i am getting "Out of Memory" errors on 2GB RAM machine. (Note the script works fine with smaller excel 2007 files)
Is there any way to read the excel file line by line without hitting the memory limit.? searching google i came across http://discuss.joelonsoftware.com/default.asp?joel.3.160328.14 but i am not familar on how to store the spreadsheet into a scalar. Can someone gimme an example of reading excel 2007 files as scalar and printing cell values. Below is the current script i am running on smaller spreadsheets.
#!/usr/bin/perl
use Excel::Writer::XLSX;
use Spreadsheet::XLSX;
my $workbook = Excel::Writer::XLSX->new('Book1.xlsx');
my $worksheet = $workbook->add_worksheet();
# use strict;
my $excel = Spreadsheet::XLSX -> new ('Book2.xlsx');
my $date_format = $workbook->add_format();
$date_format->set_num_format('dd/mm/yy hh:mm');
# Columns of interest
@columns=(0,1,2,5,9,10,12,13,31);
@reportlist=("string1","String2","String3");
@actuallist=("ModifiedString1","ModifiedString2","ModifiedString3");
$max_list=$#reportlist;
foreach my $sheet (@{$excel -> {Worksheet}}) {
printf("Sheet: %s\n", $sheet->{Name});
$sheet -> {MaxRow} ||= $sheet -> {MinRow};
foreach my $row ($sheet -> {MinRow} .. $sheet -> {MaxRow}) {
$sheet -> {MaxCol} ||= $sheet -> {MinCol};
for ($c=0;$c<=$#columns;$c++){
$col=$columns[$c];
my $cell = $sheet -> {Cells} [$row] [$col];
if($col==0){
$cell->{Val}=~ s/\ GMT\+11\:00//g;
$worksheet->write($row,$c,$cell->{Val},$date_format);
}
if ($cell) {
$worksheet->write($row,$c,$cell -> {Val});
for($z=0;$z<=$#reportisplist;$z++){
if(($cell->{Val})=~ m/$reportlist[$z]/i){
$worksheet->write($row,$c,$actuallist[$z]);
}
}
}
}
}
}
$workbook->close();
Upvotes: 4
Views: 7165
Reputation: 326
The csv solution is a good one. But also consider saving as xlsb - it will often provide similar file-size shrinking while allowing some excel capabilities. (Would have posted this as a comment but haven't the reputation ... yet).
Upvotes: 0
Reputation: 41664
I'm working on a new module for fast and memory efficient reading of Excel xlsx files with Perl. It isn't on CPAN yet (it needs a good bit more work) but you can get it on GitHub.
Here is a example of how to use it:
use strict;
use warnings;
use Excel::Reader::XLSX;
my $reader = Excel::Reader::XLSX->new();
my $workbook = $reader->read_file( 'Book1.xlsx' );
if ( !defined $workbook ) {
die $reader->error(), "\n";
}
for my $worksheet ( $workbook->worksheets() ) {
my $sheetname = $worksheet->name();
print "Sheet = $sheetname\n";
while ( my $row = $worksheet->next_row() ) {
while ( my $cell = $row->next_cell() ) {
my $row = $cell->row();
my $col = $cell->col();
my $value = $cell->value();
print " Cell ($row, $col) = $value\n";
}
}
}
__END__
Update: This module never made it to CPAN quality. Try Spreadsheet::ParseXLSX instead.
Upvotes: 5
Reputation: 1305
Try this one. Assuming you have installed Spreadsheet::Read perl module which can determine the actual parser module to use for reading a file, below code fragments read and prints the cell of 1st worksheet of the input workbook. You can examine $workbook object to see all the options available to configure. This module can be used to read files in other formats like "csv", "xls" as well. Here is the link to tutorial which I found to be useful: http://search.cpan.org/~hmbrand/Spreadsheet-Read/Read.pm
ReadData can be configured by passing options. It has many options out of each 2 options which are "cells" and "rc" can be used to modify behaviour related to file reading.By default both options are set to true. If the "cells" is true then ReadData stores cells of the workbook in a hash in returned object. If the "rc" is true then ReadData stores cells of the workbook in an array in the returned object. In the below code fragment, by setting cells => 0 , the contents of the sheet will not be stored in a hash format in returned $workbook object thus saving some space in memory. By default this option is true i.e. 1 and so . Also to further prevent from reading complete file, you can set option "rc" to false as well.
use Spreadsheet::Read;
############################################################################
# function input : file in xlsx format with absolute path
# function output : prints 1st worksheet content if exist
############################################################################
sub print_xlsx_file{
my $file_path = shift;
my $workbook = ReadData($file_path,cells => 0 );
if(defined $workbook->[0]{'error'}){
print "Error occurred while processing $file_path:".
$workbook->[0]{'error'}."\n";
exit(-1);
}
my $worksheet = $workbook->[1];
my $max_rows = $worksheet->{'maxrow'};
my $max_cols = $worksheet->{'maxcol'};
for my $row_num (1..($max_rows))
{
for my $col_num (1..($max_cols)){
print $worksheet->{'cell'}[$col_num][$row_num]."\n";
}
}
}
# call above function
# print_xlsx_file("/home/chammu/mybook.xlsx");
Upvotes: 0
Reputation: 1998
Have you tried converting the XLSX into csv and reading it as a plain text file?
Upvotes: 4