James
James

Reputation: 41

Using Python and pyxlsb lookup data from xlsb

Goals I have a workbook with several sheets in it. The sheet format always has time in column 1 in seconds and headers on the top row. I want code to look up a particular time range and column header in a sheet and then find the average of that data and paste it into a csv.

Problem Code is incredibly slow, is there a better method? Sheets are 30,000 rows 300 columns. Code only works with modified reduced data set

import pandas as pd
from pyxlsb import open_workbook
import datetime

print("Start", datetime.datetime.now())
# Define the paths
csv_file = r"L:\Projects\P1563 V4\Data\Summary\ANR Sweeps\Python\ANR Sweeps_Python_simple.csv"

# Read the CSV file (without headers)
csv_data = pd.read_csv(csv_file, encoding='ISO-8859-1', header=None)

# Extract parameters from the CSV file using numeric indexing
xlsb_path = csv_data.iloc[2, 0]  # Cell A3
xlsb_file = csv_data.iloc[2, 1]  # Cell B3
sheet_name = csv_data.iloc[0, 4]  # Cell E1
column_header = csv_data.iloc[1, 4]  # Cell E2
start_time = float(csv_data.iloc[2, 2])  # Cell C3
end_time = float(csv_data.iloc[2, 3])  # Cell D3

print("CSV open", datetime.datetime.now())


# Full path to the .xlsb file
xlsb_full_path = f"{xlsb_path}{xlsb_file}.xlsb"

with open_workbook(xlsb_full_path) as wb:
    with wb.get_sheet(sheet_name) as sheet:
        # Get the first row (assuming headers are in the first row)
        headers = next(sheet.rows())
        
        # Create a mapping of header name to its index
        header_map = {cell.v: idx for idx, cell in enumerate(headers)}
        
        # Directly access the index of the desired column header
        if column_header in header_map:
            column_index = header_map[column_header]
            print(f"Header '{column_header}' is at index: {column_index}")
        else:
            print(f"Header '{column_header}' not found.")

        # Get the indices of the required columns
        data_col_idx = header_map[column_header]
        time_col_idx = 0

        # Initialize variables for sum and count
        total = 0
        count = 0
        next(sheet.rows()) #skip column headers

   # Iterate through the rows
        for row in sheet.rows():
            try:
                # Get the time and data values
                time_value = row[time_col_idx].v
                data_value = row[data_col_idx].v
                # Use this to skip column header
                if isinstance(time_value, float):
                    if start_time <= time_value <= end_time:
                        total += data_value
                        count += 1
            except IndexError:
                # Skip rows with missing data
                continue

        # Calculate the average
        if count > 0:
            average = total / count
            print(f"Average of '{column_header}' between {start_time} and {end_time}: {average}")
        else:
            print(f"No valid data found in column '{column_header}' between {start_time} and {end_time}.")

Upvotes: 1

Views: 63

Answers (1)

J_H
J_H

Reputation: 20550

Your code looks fine.

                    if start_time <= time_value <= end_time:

That's a perfectly nice boolean expression, but there may be an opportunity to optimize it. If you know the 30,000 rows are in chronological order, then you could bail early once you see time values past the end_time. For example if you put this code in a function you could do an early return.

Skipping the initial rows takes O(n) time, linear with number of skipped rows. You might consider using binary search so your app takes O(log n) time to find the starting point. However, the library your app calls into will probably be stuck with reading all the skipped rows anyway, for little net gain.

repeated queries

junk columns

If you query the same sheet several times, with several time ranges, then it would be worth your while to put your data source on a diet and to better organize it. Right now every row has a few hundred junk columns appended at the end. Produce and save "small" worksheets containing just the columns that interest you.

RDBMS

The best thing you could do is shovel the relevant columns into an indexed table, perhaps using postgres, or using sqlite which ships as part of python's standard library.

The index lets you ignore most of the stored rows, focusing just on the relevant time interval. And the database offers COUNT(), SUM(), and AVG() functions.

Upvotes: 0

Related Questions