doctorer
doctorer

Reputation: 1772

Speeding up Python file handling for a huge dataset

I have a large dataset stored as a 17GB csv file (fileData), which contains a variable number of records (up to approx 30,000) for each customer_id. I am trying to search for specific customers (listed in fileSelection - around 1500 out of a total of 90000) and copy the records for each of these customers into a seperate csv file (fileOutput).

I am very new to Python, but using it because vba and matlab (which i am more familiar with) can't handle the file size. (I am using Aptana studio to write the code, but running the python directly from the cmd line for speed. Running 64bit Windows 7.)

The code I have written is extracting some of the customers, but has two problems: 1) It is failing to find most of the customers in the large dataset. (I believe they are all in the dataset, but cannot be completely sure.) 2) It is VERY slow. Any way to speed the code would be appreciated, including code that can better utilise a 16 core PC.

here is the code:

 `def main():

    # Initialisation : 

    #  - identify columns in slection file
    #
    fS = open (fileSelection,"r")
    if fS.mode == "r":
        header = fS.readline()
        selheaderlist = header.split(",")
        custkey =   selheaderlist.index('CUSTOMER_KEY')

    #
    # Identify columns in dataset file
    fileData = path2+file_data
    fD = open (fileData,"r")
    if fD.mode == "r":
        header = fD.readline()
        dataheaderlist = header.split(",")
        custID =   dataheaderlist.index('CUSTOMER_ID')
    fD.close()

    # For each customer in the selection file
    customercount=1
    for sr in fS:
        # Find customer key and locate it in customer ID field in dataset  
        selrecord = sr.split(",")
        requiredcustomer = selrecord[custkey]

        #Look for required customer in dataset
        found = 0
        fD = open (fileData,"r")
        if fD.mode == "r":
            while found == 0:
                dr = fD.readline()
                if not dr: break
                datrecord = dr.split(",")
                if datrecord[custID] == requiredcustomer:
                    found = 1

                    # Open outputfile
                    fileOutput= path3+file_out_root + str(requiredcustomer)+ ".csv"
                    fO=open(fileOutput,"w+")
                    fO.write(str(header))

                    #copy all records for required customer number
                    while datrecord[custID] == requiredcustomer:
                        fO.write(str(dr))
                        dr = fD.readline()
                        datrecord = dr.split(",")
                    #Close Output file          
                    fO.close()           

            if found == 1:
                print ("Customer Count "+str(customercount)+ "  Customer ID"+str(requiredcustomer)+" copied.  ")
                customercount = customercount+1
            else:
                print("Customer ID"+str(requiredcustomer)+" not found in dataset")
                fL.write (str(requiredcustomer)+","+"NOT FOUND")
            fD.close()

    fS.close()

    `

It has taken a few days to extract a couple of hundred customers, but has failed to find many more.

Sample Output

Thanks @ Paul Cornelius. This is much more efficient. I have adopted your approach, also using the csv handling suggested by @Bernardo :

# Import Modules

import csv


def main():

    # Initialisation : 


    fileSelection = path1+file_selection
    fileData = path2+file_data


    # Step through selection file and create dictionary with required ID's as keys, and empty objects
    with open(fileSelection,'rb') as csvfile:
        selected_IDs = csv.reader(csvfile)
        ID_dict = {}
        for row in selected_IDs:
            ID_dict.update({row[1]:[]})

      # step through data file: for selected customer ID's, append records to dictionary objects
    with open(fileData,'rb') as csvfile:
        dataset = csv.reader(csvfile)
        for row in dataset:
            if row[0] in ID_dict:
                    ID_dict[row[0]].extend([row[1]+','+row[4]])

        # write all dictionary objects to csv files
    for row in ID_dict.keys():
        fileOutput = path3+file_out_root+row+'.csv'
        with open(fileOutput,'wb') as csvfile:
            output = csv.writer(csvfile, delimiter='\n')
            output.writerows([ID_dict[row]])  

Upvotes: 3

Views: 2241

Answers (3)

BernardoGO
BernardoGO

Reputation: 1856

Use the csv reader instead. Python has a good library to handle CSV files so that it is not necessary for you to do splits.

Check out the documentation: https://docs.python.org/2/library/csv.html

>>> import csv
>>> with open('eggs.csv', 'rb') as csvfile:
...     spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
...     for row in spamreader:
...         print ', '.join(row)
Spam, Spam, Spam, Spam, Spam, Baked Beans
Spam, Lovely Spam, Wonderful Spam

It should perform much better.

Upvotes: 3

Paul Cornelius
Paul Cornelius

Reputation: 10906

The task is way too involved for a simple answer. But your approach is very inefficient because you have too many nested loops. Try making ONE pass through the list of customers, and for each build a "customer" object with any information that you need to use later. You put these in a dictionary; the keys are the different requiredcustomer variables and the values are the customer objects. If I were you, I would get this part to work first, before ever fooling around with the big file.

Now you step ONCE through the massive file of customer data, and each time you encounter a record whose datarecord[custID] field is in the dictionary, you append a line to the output file. You can use the relatively efficient in operator to test for membership in the dictionary.

No nested loops are necessary.

The code as you present it can't run since you write to some object named fL without ever opening it. Also, as Tim Pietzcker pointed out, you aren't closing your files since you don't actually call the close function.

Upvotes: 2

OmerBA
OmerBA

Reputation: 842

Try using pandas if your machine can handle the size of the csv in memory.

If you are looking for out of core computation - take a look at dask (they provide similar APIs)

In pandas, you can read only specific columns from a csv file, if you run into memory problems.

Anyways - both pandas and dask use C bindings which are significantly faster than pure python.

In pandas, your code would look something like:

import pandas as pd

input_csv = pd.read_csv('path_to_csv')
records_for_interesting customers = input_csv[input_csv.fileSelection.isin([list_of_ids])]
records_for_interesting customers.to_csv('output_path')

Upvotes: 2

Related Questions