g.humpkins
g.humpkins

Reputation: 341

Reading and Processing Large CSVs with Python

I have a question that is similar in spirit to this previously asked question. Nonetheless, I can't seem to figure out a suitable solution.

Input: I have CSV data that looks like

id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500

Output: from this I simply need to output each drug, the total cost which is summed over all prescriptions and I need to get a count of the unique number of prescribers.

drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0

I was able to accomplish this pretty easily with Python. However, when I try to run my code with a much larger (1gb) input, my code does not terminate in a reasonable amount of time.

import sys, csv

def duplicate_id(id, id_list):
    if id in id_list:
        return True
    else:
        return False

def write_file(d, output):
    path = output
    # path = './output/top_cost_drug.txt'
    with open(path, 'w', newline='') as csvfile:
        fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for key, value in d.items():
            print(key, value)
            writer.writerow({'drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1])})

def read_file(data):
    # TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
    drug_info = {}
    with open(data) as csvfile:
        readCSV = csv.reader(csvfile, delimiter=',')
        next(readCSV)
        for row in readCSV:
            prescriber_id = row[0]
            prescribed_drug = row[3]
            prescribed_drug_cost = float(row[4])

            if prescribed_drug not in drug_info:
                drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
            else:
                if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
                    drug_info[prescribed_drug][0].append(prescriber_id)
                    drug_info[prescribed_drug][1].append(prescribed_drug_cost)
                else:
                    drug_info[prescribed_drug][1].append(prescribed_drug_cost)
    return(drug_info)

def main():
    data = sys.argv[1]
    output = sys.argv[2]
    drug_info = read_file(data)
    write_file(drug_info, output)

if __name__ == "__main__":
    main()

I am having trouble figuring out how to refactor this to handle the larger input and was hoping someone could take a look and provide me some suggestions for how to solve this problem.

Upvotes: 0

Views: 51

Answers (2)

Gowdhaman008
Gowdhaman008

Reputation: 1323

IF you can use pandas, Please try the following. Pandas reads your file and store them in dataframe. It is much faster than our manual file processing using iterator.

import pandas as pd


df = pd.read_csv('sample_data.txt')

columns = ['id','drug_name','drug_cost']


df1 = df[columns]
gd = df1.groupby('drug_name')
cnt= gd.count()
s=gd.sum()

out = s.join(cnt,lsuffix='x')
out['total_cost']=out['drug_costx']
out['num_prescriber']=out['drug_cost']
fout = out[['num_prescriber','total_cost']]

fout.to_csv('out_data.csv')

I am getting the following output.

drug_name,num_prescriber,total_cost
AMBIEN,2,300
BENZTROPINE MESYLATE,1,1500
CHLORPROMAZINE,2,3000

Hope this helps.

Upvotes: 1

blhsing
blhsing

Reputation: 106553

Lists are inefficient at testing membership especially when the length is in the thousands as it costs O(n). Use sets instead to store your prescriber ids, which will reduce the cost of membership tests to O(1).

        if prescribed_drug not in drug_info:
            drug_info[prescribed_drug] = ({prescriber_id}, [prescribed_drug_cost])
        else:
            if prescriber_id not in drug_info[prescribed_drug][0]:
                drug_info[prescribed_drug][0].add(prescriber_id)
                drug_info[prescribed_drug][1].append(prescribed_drug_cost)
            else:
                drug_info[prescribed_drug][1].append(prescribed_drug_cost)

Upvotes: 0

Related Questions