Reputation: 341
I have a question that is similar in spirit to this previously asked question. Nonetheless, I can't seem to figure out a suitable solution.
Input: I have CSV data that looks like
id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500
Output: from this I simply need to output each drug, the total cost which is summed over all prescriptions and I need to get a count of the unique number of prescribers.
drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0
I was able to accomplish this pretty easily with Python. However, when I try to run my code with a much larger (1gb) input, my code does not terminate in a reasonable amount of time.
import sys, csv
def duplicate_id(id, id_list):
if id in id_list:
return True
else:
return False
def write_file(d, output):
path = output
# path = './output/top_cost_drug.txt'
with open(path, 'w', newline='') as csvfile:
fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for key, value in d.items():
print(key, value)
writer.writerow({'drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1])})
def read_file(data):
# TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
drug_info = {}
with open(data) as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
next(readCSV)
for row in readCSV:
prescriber_id = row[0]
prescribed_drug = row[3]
prescribed_drug_cost = float(row[4])
if prescribed_drug not in drug_info:
drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
else:
if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
drug_info[prescribed_drug][0].append(prescriber_id)
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
else:
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
return(drug_info)
def main():
data = sys.argv[1]
output = sys.argv[2]
drug_info = read_file(data)
write_file(drug_info, output)
if __name__ == "__main__":
main()
I am having trouble figuring out how to refactor this to handle the larger input and was hoping someone could take a look and provide me some suggestions for how to solve this problem.
Upvotes: 0
Views: 51
Reputation: 1323
IF you can use pandas, Please try the following. Pandas reads your file and store them in dataframe. It is much faster than our manual file processing using iterator.
import pandas as pd
df = pd.read_csv('sample_data.txt')
columns = ['id','drug_name','drug_cost']
df1 = df[columns]
gd = df1.groupby('drug_name')
cnt= gd.count()
s=gd.sum()
out = s.join(cnt,lsuffix='x')
out['total_cost']=out['drug_costx']
out['num_prescriber']=out['drug_cost']
fout = out[['num_prescriber','total_cost']]
fout.to_csv('out_data.csv')
I am getting the following output.
drug_name,num_prescriber,total_cost
AMBIEN,2,300
BENZTROPINE MESYLATE,1,1500
CHLORPROMAZINE,2,3000
Hope this helps.
Upvotes: 1
Reputation: 106553
Lists are inefficient at testing membership especially when the length is in the thousands as it costs O(n). Use sets instead to store your prescriber ids, which will reduce the cost of membership tests to O(1).
if prescribed_drug not in drug_info:
drug_info[prescribed_drug] = ({prescriber_id}, [prescribed_drug_cost])
else:
if prescriber_id not in drug_info[prescribed_drug][0]:
drug_info[prescribed_drug][0].add(prescriber_id)
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
else:
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
Upvotes: 0