Reading and Processing Large CSVs with Python

Question

I have a question that is similar in spirit to this previously asked question. Nonetheless, I can't seem to figure out a suitable solution.

Input: I have CSV data that looks like

id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500

Output: from this I simply need to output each drug, the total cost which is summed over all prescriptions and I need to get a count of the unique number of prescribers.

drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0

I was able to accomplish this pretty easily with Python. However, when I try to run my code with a much larger (1gb) input, my code does not terminate in a reasonable amount of time.

import sys, csv

def duplicate_id(id, id_list):
    if id in id_list:
        return True
    else:
        return False

def write_file(d, output):
    path = output
    # path = './output/top_cost_drug.txt'
    with open(path, 'w', newline='') as csvfile:
        fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for key, value in d.items():
            print(key, value)
            writer.writerow({'drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1])})

def read_file(data):
    # TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
    drug_info = {}
    with open(data) as csvfile:
        readCSV = csv.reader(csvfile, delimiter=',')
        next(readCSV)
        for row in readCSV:
            prescriber_id = row[0]
            prescribed_drug = row[3]
            prescribed_drug_cost = float(row[4])

            if prescribed_drug not in drug_info:
                drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
            else:
                if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
                    drug_info[prescribed_drug][0].append(prescriber_id)
                    drug_info[prescribed_drug][1].append(prescribed_drug_cost)
                else:
                    drug_info[prescribed_drug][1].append(prescribed_drug_cost)
    return(drug_info)

def main():
    data = sys.argv[1]
    output = sys.argv[2]
    drug_info = read_file(data)
    write_file(drug_info, output)

if __name__ == "__main__":
    main()

I am having trouble figuring out how to refactor this to handle the larger input and was hoping someone could take a look and provide me some suggestions for how to solve this problem.

Gowdhaman008 · Accepted Answer

IF you can use pandas, Please try the following. Pandas reads your file and store them in dataframe. It is much faster than our manual file processing using iterator.

import pandas as pd


df = pd.read_csv('sample_data.txt')

columns = ['id','drug_name','drug_cost']


df1 = df[columns]
gd = df1.groupby('drug_name')
cnt= gd.count()
s=gd.sum()

out = s.join(cnt,lsuffix='x')
out['total_cost']=out['drug_costx']
out['num_prescriber']=out['drug_cost']
fout = out[['num_prescriber','total_cost']]

fout.to_csv('out_data.csv')

I am getting the following output.

drug_name,num_prescriber,total_cost
AMBIEN,2,300
BENZTROPINE MESYLATE,1,1500
CHLORPROMAZINE,2,3000

Hope this helps.

Reading and Processing Large CSVs with Python

Answers (2)

Related Questions