Reputation: 41
Goals I have a workbook with several sheets in it. The sheet format always has time in column 1 in seconds and headers on the top row. I want code to look up a particular time range and column header in a sheet and then find the average of that data and paste it into a csv.
Problem Code is incredibly slow, is there a better method? Sheets are 30,000 rows 300 columns. Code only works with modified reduced data set
import pandas as pd
from pyxlsb import open_workbook
import datetime
print("Start", datetime.datetime.now())
# Define the paths
csv_file = r"L:\Projects\P1563 V4\Data\Summary\ANR Sweeps\Python\ANR Sweeps_Python_simple.csv"
# Read the CSV file (without headers)
csv_data = pd.read_csv(csv_file, encoding='ISO-8859-1', header=None)
# Extract parameters from the CSV file using numeric indexing
xlsb_path = csv_data.iloc[2, 0] # Cell A3
xlsb_file = csv_data.iloc[2, 1] # Cell B3
sheet_name = csv_data.iloc[0, 4] # Cell E1
column_header = csv_data.iloc[1, 4] # Cell E2
start_time = float(csv_data.iloc[2, 2]) # Cell C3
end_time = float(csv_data.iloc[2, 3]) # Cell D3
print("CSV open", datetime.datetime.now())
# Full path to the .xlsb file
xlsb_full_path = f"{xlsb_path}{xlsb_file}.xlsb"
with open_workbook(xlsb_full_path) as wb:
with wb.get_sheet(sheet_name) as sheet:
# Get the first row (assuming headers are in the first row)
headers = next(sheet.rows())
# Create a mapping of header name to its index
header_map = {cell.v: idx for idx, cell in enumerate(headers)}
# Directly access the index of the desired column header
if column_header in header_map:
column_index = header_map[column_header]
print(f"Header '{column_header}' is at index: {column_index}")
else:
print(f"Header '{column_header}' not found.")
# Get the indices of the required columns
data_col_idx = header_map[column_header]
time_col_idx = 0
# Initialize variables for sum and count
total = 0
count = 0
next(sheet.rows()) #skip column headers
# Iterate through the rows
for row in sheet.rows():
try:
# Get the time and data values
time_value = row[time_col_idx].v
data_value = row[data_col_idx].v
# Use this to skip column header
if isinstance(time_value, float):
if start_time <= time_value <= end_time:
total += data_value
count += 1
except IndexError:
# Skip rows with missing data
continue
# Calculate the average
if count > 0:
average = total / count
print(f"Average of '{column_header}' between {start_time} and {end_time}: {average}")
else:
print(f"No valid data found in column '{column_header}' between {start_time} and {end_time}.")
Upvotes: 1
Views: 63
Reputation: 20550
Your code looks fine.
if start_time <= time_value <= end_time:
That's a perfectly nice boolean expression,
but there may be an opportunity to optimize it.
If you know the 30,000 rows are in chronological order,
then you could bail early once you see time values
past the end_time
.
For example if you put this code in a function
you could do an early return
.
Skipping the initial rows takes O(n) time, linear with number of skipped rows. You might consider using binary search so your app takes O(log n) time to find the starting point. However, the library your app calls into will probably be stuck with reading all the skipped rows anyway, for little net gain.
If you query the same sheet several times, with several time ranges, then it would be worth your while to put your data source on a diet and to better organize it. Right now every row has a few hundred junk columns appended at the end. Produce and save "small" worksheets containing just the columns that interest you.
The best thing you could do is shovel the relevant columns into an indexed table, perhaps using postgres, or using sqlite which ships as part of python's standard library.
The index lets you ignore most of the stored rows, focusing just on the relevant time interval. And the database offers COUNT(), SUM(), and AVG() functions.
Upvotes: 0