Reputation: 675
I am converting a set of ~200 .xlsx files, each with multiple worksheets, into .csv. in2csv
allows extracting a single sheet using the -s
switch, but first I need to obtain the names of all worksheets in an .xlsx spreadsheet from the command line.
I've tried python's xlrd
package but it takes minutes for a single file >100MB, because it requires loading the entire workbook in order to read the worksheet wrappers.
I'm aware of gnumeric ssconvert
(the -S
switch does the trick) but would rather not install a 600MB package for just one function.
Is there a way to extract the worksheet names without loading the entire spreadsheet/workbook? If not, how can I separately convert each worksheet and output to separate output files?
Below is my solution thus far, first, the bash script to find all unconverted files, and second, the python script to extract the worksheet names.
#!/bin/bash
# Paths
RAW_DATA_DIR_EBS=/mnt/data/shared/raw
CSV_DATA_DIR_EBS=/mnt/data/shared/csv
ETL_CONVERT_DIR=$(pwd)/$(dirname "$0")
function check_conversion {
# If filesize = 0 bytes, exit with error
filesize=$(wc -b $1)
if [ $filesize == 0 ]; then
echo "ERROR: conversion failed. Empty output file: $1"
exit 1
fi;
}
# For each data source directory
for source in $RAW_DATA_DIR_EBS/; do
dir=$(basename $source)
# Create output dir if absent
mkdir -p $CSV_DATA_DIR_EBS/$dir
# For each file for that source
for fin in $RAW_DATA_DIR_EBS/$dir/*.xlsx; do
# Get sheet names and store in array
echo "Obtaining worksheet names from $fin"
sheets=$(python $ETL_CONVERT_DIR/check_sheet_names.py -x $fin | tr -d '[]' | tr -d [:punct:])
IFS="," read -r -a sheets_array <<< "$sheets"
if [ ${#sheets_array[@]} == 1 ]; then
# Just one sheet
fout=$CSV_DATA_DIR_EBS/$dir/$(basename "$fin" .xlsx).csv
if [ ! -e $fout ]; then
echo "Converting worksheet in $fin ..."
in2csv -e utf-8 $fin > $fout;
check_conversion $fout
gzip $fout
else
echo "Spreadsheet $fin already converted:" $fout;
fi;
else
# Multiple sheets
for sheet in $sheets; do
fout=$CSV_DATA_DIR_EBS/$dir/$(basename "$fin" .xlsx)__sheet_"${sheet}".csv
if [ ! -e $fout ]; then
echo "Converting worksheet $sheet in $fin ..."
in2csv -e utf-8 --sheet $sheet $fin > $fout;
check_conversion $fout;
gzip $fout
else
echo "Spreadsheet $fin already converted:" $fout
fi;
done;
fi;
done;
done;
exit 0
#!/usr/bin/env python
''' check_sheet_names.py
Get names of worksheets contained in an .xlsx spreadsheet
'''
import xlrd
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='CLI')
parser.add_argument('--xlsx', '-x', type=str, required=True)
args = parser.parse_args()
xls = xlrd.open_workbook(args.xlsx, on_demand=True)
print(xls.sheet_names())
xls.release_resources()
del xls
Upvotes: 1
Views: 1772
Reputation: 580
The -n flag will provide the names of the different sheets in an xlsx doc.
in2csv -n filename.xlsx
Upvotes: 3