Reputation: 1546
What is the fastest way to process each line of a csv and write to a new csv ? Is there a way to use the least memory as well as be the fastest? Please see the following code. It requests a csv from an API but it takes very long to go through the for loop I commented. Also I think it is using all the memory on my server.
from pandas import *
import csv
import requests
reportResult = requests.get(api,headers=header)
csvReader = csv.reader(utf_8_encoder(reportResult.text))
reportData = []
#for loop takes a long time
for row in csvReader:
combinedDict = dict(zip(fields, row))
combinedDict = cleanDict(combinedDict)
reportData.append(combinedDict)
reportDF = DataFrame(reportData, columns = fields)
reportDF.to_csv('report.csv',sep=',',header=False,index=False)
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')
def cleanDict(combinedDict):
if combinedDict.get('a_id', None) is not None:
combinedDict['a_id'] = int(
float(combinedDict['a_id']))
combinedDict['unique_a_id'] = ('1_a_'+
str(combinedDict['a_id']))
if combinedDict.get('i_id', None) is not None:
combinedDict['i_id'] =int(
float(combinedDict['i_id']))
combinedDict['unique_i_id'] = ('1_i_'+
str(combinedDict['i_id']))
if combinedDict.get('pm', None) is not None:
combinedDict['pm'] = "{0:.10f}".format(float(combinedDict['pm']))
if combinedDict.get('s', None) is not None:
combinedDict['s'] = "{0:.10f}".format(float(combinedDict['s']))
return combinedDict
When I run the python memory profiler , why is the line on the for loop having memory increment? Is the actual for loop saving something in memory, or is my utf-8 convertor messing something up?
Line # Mem usage Increment Line Contents
================================================
162 1869.254 MiB 1205.824 MiB for row in csvReader:
163 #print row
164 1869.254 MiB 0.000 MiB combinedDict = dict(zip(fields, row))
When I put the "@profile" symbol on the utf_8-encoder function as well, I see the memory on the above for loop disappeared:
163 for row in csvReader:
But now there is memory on the convertor's for loop (i didn't let it run as long as last time so it only got to 56MB before I did ctrl+C):
Line # Mem usage Increment Line Contents
================================================
154 663.430 MiB 0.000 MiB @profile
155 def utf_8_encoder(unicode_csv_data):
156 722.496 MiB 59.066 MiB for line in unicode_csv_data:
157 722.496 MiB 0.000 MiB yield line.encode('utf-8')
Upvotes: 4
Views: 9064
Reputation: 1546
I found it to be much faster and not using so much memory my server crashes to use dataframes to read the csv:
from cStringIO import StringIO
from pandas import *
reportText = StringIO(reportResult.text)
reportDF = read_csv(reportText, sep=',',parse_dates=False)
Then I am able to process it using apply, for example:
def trimFloat(fl):
if fl is not None:
res = "{0:.10f}".format(float(fl))
return res
else:
return None
floatCols = ['a', 'b ']
for col in floatCols:
reportDF[col] = reportDF[col].apply(trimFloat)
def removePct(reportDF):
reportDF['c'] = reportDF['c'].apply(lambda x: x.translate(None, '%'))
return reportDF
I suspect the major issue with the previous attempt had something to do with the UTF8 encoder
Upvotes: 2
Reputation: 3682
For starters, for should use izip from itertools. See below.
from itertools import izip
reportData = []
for row in csvReader:
combinedDict = dict(izip(fields, row))
combinedDict = cleanDict(combinedDict) #cleaned dict method is probably where the bottle neck is
reportData.append(combinedDict)
in izip is a generator version of zip, which it has a lower memory impact. Though you probably won't have much of a gain since it looks like you're zipping one item at a time. I would take a look at your cleanDict() function. It has tons of if statements to evaluate and as such it takes time. Lastly, if you are really pressed for more speed and can't figure out where to get it from, check using the
from concurrent.futures import ProcessPoolExecutor
or in other words take a look at parallel processing. https://docs.python.org/3/library/concurrent.futures.html
Also please take a look at the PEP 8 guidelines for python. https://www.python.org/dev/peps/pep-0008/ Your indentations are wrong. All indentations should be 4 spaces. If nothing else it helps with readability.
Upvotes: 0