yunque
yunque

Reputation: 675

Extract worksheet names from a .xlsx spreadsheet, efficiently

I am converting a set of ~200 .xlsx files, each with multiple worksheets, into .csv. in2csv allows extracting a single sheet using the -s switch, but first I need to obtain the names of all worksheets in an .xlsx spreadsheet from the command line.

I've tried python's xlrd package but it takes minutes for a single file >100MB, because it requires loading the entire workbook in order to read the worksheet wrappers.

I'm aware of gnumeric ssconvert (the -S switch does the trick) but would rather not install a 600MB package for just one function.

Is there a way to extract the worksheet names without loading the entire spreadsheet/workbook? If not, how can I separately convert each worksheet and output to separate output files?

Below is my solution thus far, first, the bash script to find all unconverted files, and second, the python script to extract the worksheet names.

#!/bin/bash

# Paths
RAW_DATA_DIR_EBS=/mnt/data/shared/raw
CSV_DATA_DIR_EBS=/mnt/data/shared/csv
ETL_CONVERT_DIR=$(pwd)/$(dirname "$0")

function check_conversion {
    # If filesize = 0 bytes, exit with error
    filesize=$(wc -b $1)
    if [ $filesize == 0 ]; then
        echo "ERROR: conversion failed. Empty output file: $1"
        exit 1
    fi;
}

# For each data source directory
for source in $RAW_DATA_DIR_EBS/; do
    dir=$(basename $source)    

    # Create output dir if absent
    mkdir -p $CSV_DATA_DIR_EBS/$dir

    # For each file for that source
    for fin in $RAW_DATA_DIR_EBS/$dir/*.xlsx; do

        # Get sheet names and store in array
        echo "Obtaining worksheet names from $fin"
        sheets=$(python $ETL_CONVERT_DIR/check_sheet_names.py -x $fin | tr -d '[]' | tr -d [:punct:])
        IFS="," read -r -a sheets_array <<< "$sheets"   

        if [ ${#sheets_array[@]} == 1 ]; then

            # Just one sheet
            fout=$CSV_DATA_DIR_EBS/$dir/$(basename "$fin" .xlsx).csv
            if [ ! -e $fout ]; then
                echo "Converting worksheet in $fin ..."
                in2csv -e utf-8 $fin > $fout;
                check_conversion $fout
                gzip $fout
            else
                echo "Spreadsheet $fin already converted:" $fout;
            fi;

        else

            # Multiple sheets
            for sheet in $sheets; do
                fout=$CSV_DATA_DIR_EBS/$dir/$(basename "$fin" .xlsx)__sheet_"${sheet}".csv
                if [ ! -e $fout ]; then
                    echo "Converting worksheet $sheet in $fin ..."
                    in2csv -e utf-8 --sheet $sheet $fin > $fout;
                    check_conversion $fout;
                    gzip $fout
                else
                    echo "Spreadsheet $fin already converted:" $fout
                fi;
            done;
        fi;
    done;

done;
exit 0
#!/usr/bin/env python
''' check_sheet_names.py

Get names of worksheets contained in an .xlsx spreadsheet
'''

import xlrd
import argparse

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='CLI')
    parser.add_argument('--xlsx', '-x', type=str, required=True)
    args = parser.parse_args()

    xls = xlrd.open_workbook(args.xlsx, on_demand=True)
    print(xls.sheet_names())
    xls.release_resources()
    del xls

Upvotes: 1

Views: 1772

Answers (1)

BigToach
BigToach

Reputation: 580

The -n flag will provide the names of the different sheets in an xlsx doc.

in2csv -n filename.xlsx

Upvotes: 3

Related Questions