user3089324
user3089324

Reputation: 29

I/O efficiency in Python

I am writing a program that:

I have written the script and everything works fine. however, the computational time is HUGE. For an hour, it has done just 200 rows from the first sheet, resulting in writing 200 different files.

I was wondering if there is a way to save the matching in a different way as I am going to use them later on? Is there any way to save in a matrix or something?

import xlrd
import xlsxwriter
import os, itertools
from datetime import datetime
# choose the incident excel sheet
book_1 = xlrd.open_workbook('D:/Users/d774911/Desktop/Telstra Internship/Working files/Incidents.xlsx')
# choose the trap excel sheet
book_2 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Traps.xlsx")
# choose the features sheet
book_3 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Features.xlsx")
# select the working sheet, either by name or by index
Traps = book_2.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Incidents = book_1.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Features_Numbers = book_3.sheet_by_name('Sheet1')
#return the total number of rows for the traps sheet
Total_Number_of_Rows_Traps = Traps.nrows
# return the total number of rows for the incident sheet
Total_Number_of_Rows_Incidents = Incidents.nrows
# open a file two write down the non matching incident's numbers
print(Total_Number_of_Rows_Traps, Total_Number_of_Rows_Incidents)
write_no_matching =   open('C:/Users/d774911/PycharmProjects/GlobalData/No_Matching.txt', 'w')

# For loop to iterate for all the row for the incident sheet
for Rows_Incidents in range(Total_Number_of_Rows_Incidents):
    # Store content for the comparable cell for incident sheet
    Incidents_Content_Affected_resources = Incidents.cell_value(Rows_Incidents, 47)
    # Store content for the comparable cell for incident sheet
    Incidents_Content_Product_Type = Incidents.cell_value(Rows_Incidents, 29)
    # Convert Excel date type into python type
    Incidents_Content_Date = xlrd.xldate_as_tuple(Incidents.cell_value(Rows_Incidents, 2), book_1.datemode)
    # extract the year, month and day
    Incidents_Content_Date = str(Incidents_Content_Date[0]) + ' ' + str(Incidents_Content_Date[1]) + ' ' + str(Incidents_Content_Date[2])
    # Store content for the comparable cell for incident sheet
    Incidents_Content_Date = datetime.strptime(Incidents_Content_Date, '%Y %m %d')
    # extract the incident number
    Incident_Name = Incidents.cell_value(Rows_Incidents, 0)
    # Create a workbook for the selected incident
    Incident_Name_Book = xlsxwriter.Workbook(os.path.join('C:/Users/d774911/PycharmProjects/GlobalData/Test/', Incident_Name + '.xlsx'))
    # Create sheet name for the created workbook
    Incident_Name_Sheet = Incident_Name_Book.add_worksheet('Sheet1')
    # insert the first row that contains the features
    Incident_Name_Sheet.write_row(0, 0, Features_Numbers.row_values(0))
    Insert_Row_to_Incident_Sheet = 0

# For loop to iterate for all the row for the traps sheet
for Rows_Traps in range(Total_Number_of_Rows_Traps):

    # Store content for the comparable cell for traps sheet
    Traps_Content_Node_Name = Traps.cell_value(Rows_Traps, 3)
    # Store content for the comparable cell for traps sheet
    Traps_Content_Event_Type = Traps.cell_value(Rows_Traps, 6)
    # extract date temporally
    Traps_Content_Date_temp = Traps.cell_value(Rows_Traps, 10)
    # Store content for the comparable cell for traps sheet
    Traps_Content_Date = datetime.strptime(Traps_Content_Date_temp[0:10], '%Y-%m-%d')

    # If the content matches partially or full
    if len(str(Traps_Content_Node_Name)) * len(str(Incidents_Content_Affected_resources)) != 0 and \
            str(Incidents_Content_Affected_resources).lower().find(str(Traps_Content_Node_Name).lower()) != -1 and \
            len(str(Traps_Content_Event_Type)) * len(str(Incidents_Content_Product_Type)) != 0 and \
            str(Incidents_Content_Product_Type).lower().find(str(Traps_Content_Event_Type).lower()) != -1 and \
            len(str(Traps_Content_Date)) * len(str(Incidents_Content_Date)) != 0 and \
            Traps_Content_Date <= Incidents_Content_Date:
        # counter for writing inside the new incident sheet
        Insert_Row_to_Incident_Sheet = Insert_Row_to_Incident_Sheet + 1
        # Write the Incident information
        Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 0, Incidents.row_values(Rows_Incidents))
        # Write the Traps information
        Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 107, Traps.row_values(Rows_Traps))

Incident_Name_Book.close()

Thanks

Upvotes: 0

Views: 140

Answers (3)

RaJa
RaJa

Reputation: 1567

I would suggest that you use pandas. This module provides a huge amount of functions to compare datasets. It also has a very fast import/export algorithms for excel files.

IMHO you should use the merge function and provide the arguments how = 'inner' and on = [list of your columns to compare]. That will create a new dataset with only such rows that occur in both tables (having the same values in the defined colums). This new dataset you can export to your excel file.

Upvotes: 1

Aaron Digulla
Aaron Digulla

Reputation: 328556

I don't see how your code can work; the second loop works on variables which change for every row in the first loop but the second loop isn't inside of the first one.

That said, comparing files in this way has a complexity of O(N*M) which means that the runtime explodes quickly. In your case you try to execute 54'000'000'000 (54 billion) loops.

If you run into these kind of problems, the solution is always a three step process:

  1. Transform the data to make it easier to process
  2. Put the data into efficient structures (sorted lists, dict)
  3. Search the data with the efficient structures

You have to find a way to get rid of the find(). Try to get rid of all the junk in the cells that you want to compare so that you could use =. When you have this, you can put rows into a dict to find matches. Or you could load it into a SQL database and use SQL queries (don't forget to add indexes!)

One last trick is to use sorted lists. If you can sort the data in the same way, then you can use two lists:

  1. Sort the data from the two sheets into two lists
  2. Use two row counters (one per list)
  3. If the current item from the first list is less than the current one from the second list, then there is no match and you have to advance the first row counter.
  4. If the current item from the first list is bigger than the current one from the second list, then there is no match and you have to advance the second row counter.
  5. If the items are the same, you have a match. Process the match and advance both counters.

This allows you to process all the items in one go.

Upvotes: 3

knitti
knitti

Reputation: 7033

What your doing is seeking/reading a litte bit of data for each cell. This is very inefficient.

Try reading all information in one go into an as basic as sensible python data structure (lists, dicts etc.) and make your comparisons/operations on this data set in memory and write all results in one go. If not all data fits into memory, try to partition it into sub-tasks.

Having to read the data set 10 times, to extract a tenth of data each time will likely still be hugely faster than reading each cell independently.

Upvotes: 3

Related Questions