Reputation: 29
I am writing a program that:
I have written the script and everything works fine. however, the computational time is HUGE. For an hour, it has done just 200 rows from the first sheet, resulting in writing 200 different files.
I was wondering if there is a way to save the matching in a different way as I am going to use them later on? Is there any way to save in a matrix or something?
import xlrd
import xlsxwriter
import os, itertools
from datetime import datetime
# choose the incident excel sheet
book_1 = xlrd.open_workbook('D:/Users/d774911/Desktop/Telstra Internship/Working files/Incidents.xlsx')
# choose the trap excel sheet
book_2 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Traps.xlsx")
# choose the features sheet
book_3 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Features.xlsx")
# select the working sheet, either by name or by index
Traps = book_2.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Incidents = book_1.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Features_Numbers = book_3.sheet_by_name('Sheet1')
#return the total number of rows for the traps sheet
Total_Number_of_Rows_Traps = Traps.nrows
# return the total number of rows for the incident sheet
Total_Number_of_Rows_Incidents = Incidents.nrows
# open a file two write down the non matching incident's numbers
print(Total_Number_of_Rows_Traps, Total_Number_of_Rows_Incidents)
write_no_matching = open('C:/Users/d774911/PycharmProjects/GlobalData/No_Matching.txt', 'w')
# For loop to iterate for all the row for the incident sheet
for Rows_Incidents in range(Total_Number_of_Rows_Incidents):
# Store content for the comparable cell for incident sheet
Incidents_Content_Affected_resources = Incidents.cell_value(Rows_Incidents, 47)
# Store content for the comparable cell for incident sheet
Incidents_Content_Product_Type = Incidents.cell_value(Rows_Incidents, 29)
# Convert Excel date type into python type
Incidents_Content_Date = xlrd.xldate_as_tuple(Incidents.cell_value(Rows_Incidents, 2), book_1.datemode)
# extract the year, month and day
Incidents_Content_Date = str(Incidents_Content_Date[0]) + ' ' + str(Incidents_Content_Date[1]) + ' ' + str(Incidents_Content_Date[2])
# Store content for the comparable cell for incident sheet
Incidents_Content_Date = datetime.strptime(Incidents_Content_Date, '%Y %m %d')
# extract the incident number
Incident_Name = Incidents.cell_value(Rows_Incidents, 0)
# Create a workbook for the selected incident
Incident_Name_Book = xlsxwriter.Workbook(os.path.join('C:/Users/d774911/PycharmProjects/GlobalData/Test/', Incident_Name + '.xlsx'))
# Create sheet name for the created workbook
Incident_Name_Sheet = Incident_Name_Book.add_worksheet('Sheet1')
# insert the first row that contains the features
Incident_Name_Sheet.write_row(0, 0, Features_Numbers.row_values(0))
Insert_Row_to_Incident_Sheet = 0
# For loop to iterate for all the row for the traps sheet
for Rows_Traps in range(Total_Number_of_Rows_Traps):
# Store content for the comparable cell for traps sheet
Traps_Content_Node_Name = Traps.cell_value(Rows_Traps, 3)
# Store content for the comparable cell for traps sheet
Traps_Content_Event_Type = Traps.cell_value(Rows_Traps, 6)
# extract date temporally
Traps_Content_Date_temp = Traps.cell_value(Rows_Traps, 10)
# Store content for the comparable cell for traps sheet
Traps_Content_Date = datetime.strptime(Traps_Content_Date_temp[0:10], '%Y-%m-%d')
# If the content matches partially or full
if len(str(Traps_Content_Node_Name)) * len(str(Incidents_Content_Affected_resources)) != 0 and \
str(Incidents_Content_Affected_resources).lower().find(str(Traps_Content_Node_Name).lower()) != -1 and \
len(str(Traps_Content_Event_Type)) * len(str(Incidents_Content_Product_Type)) != 0 and \
str(Incidents_Content_Product_Type).lower().find(str(Traps_Content_Event_Type).lower()) != -1 and \
len(str(Traps_Content_Date)) * len(str(Incidents_Content_Date)) != 0 and \
Traps_Content_Date <= Incidents_Content_Date:
# counter for writing inside the new incident sheet
Insert_Row_to_Incident_Sheet = Insert_Row_to_Incident_Sheet + 1
# Write the Incident information
Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 0, Incidents.row_values(Rows_Incidents))
# Write the Traps information
Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 107, Traps.row_values(Rows_Traps))
Incident_Name_Book.close()
Thanks
Upvotes: 0
Views: 140
Reputation: 1567
I would suggest that you use pandas
. This module provides a huge amount of functions to compare datasets. It also has a very fast import/export algorithms for excel files.
IMHO you should use the merge function and provide the arguments how = 'inner'
and on = [list of your columns to compare]
. That will create a new dataset with only such rows that occur in both tables (having the same values in the defined colums). This new dataset you can export to your excel file.
Upvotes: 1
Reputation: 328556
I don't see how your code can work; the second loop works on variables which change for every row in the first loop but the second loop isn't inside of the first one.
That said, comparing files in this way has a complexity of O(N*M) which means that the runtime explodes quickly. In your case you try to execute 54'000'000'000 (54 billion) loops.
If you run into these kind of problems, the solution is always a three step process:
dict
)You have to find a way to get rid of the find()
. Try to get rid of all the junk in the cells that you want to compare so that you could use =
. When you have this, you can put rows into a dict
to find matches. Or you could load it into a SQL database and use SQL queries (don't forget to add indexes!)
One last trick is to use sorted lists. If you can sort the data in the same way, then you can use two lists:
This allows you to process all the items in one go.
Upvotes: 3
Reputation: 7033
What your doing is seeking/reading a litte bit of data for each cell. This is very inefficient.
Try reading all information in one go into an as basic as sensible python data structure (lists, dicts etc.) and make your comparisons/operations on this data set in memory and write all results in one go. If not all data fits into memory, try to partition it into sub-tasks.
Having to read the data set 10 times, to extract a tenth of data each time will likely still be hugely faster than reading each cell independently.
Upvotes: 3