user1689987
user1689987

Reputation: 1546

fastest way in python to read csv, process each line, and write a new csv

What is the fastest way to process each line of a csv and write to a new csv ? Is there a way to use the least memory as well as be the fastest? Please see the following code. It requests a csv from an API but it takes very long to go through the for loop I commented. Also I think it is using all the memory on my server.

from pandas import *
import csv
import requests

reportResult = requests.get(api,headers=header)
csvReader = csv.reader(utf_8_encoder(reportResult.text))
reportData = []
#for loop takes a long time
for row in csvReader:
  combinedDict  = dict(zip(fields, row))
  combinedDict = cleanDict(combinedDict)
  reportData.append(combinedDict)
reportDF = DataFrame(reportData, columns = fields)
reportDF.to_csv('report.csv',sep=',',header=False,index=False)



def utf_8_encoder(unicode_csv_data):
  for line in unicode_csv_data:
    yield line.encode('utf-8')



def cleanDict(combinedDict):
  if combinedDict.get('a_id', None) is not None:
    combinedDict['a_id'] = int(
        float(combinedDict['a_id']))
    combinedDict['unique_a_id'] = ('1_a_'+
           str(combinedDict['a_id']))
  if combinedDict.get('i_id', None) is not None:
    combinedDict['i_id'] =int(
        float(combinedDict['i_id']))
    combinedDict['unique_i_id'] = ('1_i_'+
         str(combinedDict['i_id']))
 if combinedDict.get('pm', None) is not None:
    combinedDict['pm'] = "{0:.10f}".format(float(combinedDict['pm']))
  if combinedDict.get('s', None) is not None:
    combinedDict['s'] = "{0:.10f}".format(float(combinedDict['s']))
  return combinedDict 

When I run the python memory profiler , why is the line on the for loop having memory increment? Is the actual for loop saving something in memory, or is my utf-8 convertor messing something up?

Line #    Mem usage    Increment   Line Contents
================================================
   162 1869.254 MiB 1205.824 MiB     for row in csvReader:
   163                                 #print row
   164 1869.254 MiB    0.000 MiB       combinedDict  = dict(zip(fields, row))

When I put the "@profile" symbol on the utf_8-encoder function as well, I see the memory on the above for loop disappeared:

   163                               for row in csvReader:

But now there is memory on the convertor's for loop (i didn't let it run as long as last time so it only got to 56MB before I did ctrl+C):

Line #    Mem usage    Increment   Line Contents
================================================
   154  663.430 MiB    0.000 MiB   @profile
   155                             def utf_8_encoder(unicode_csv_data):
   156  722.496 MiB   59.066 MiB     for line in unicode_csv_data:
   157  722.496 MiB    0.000 MiB       yield line.encode('utf-8')

Upvotes: 4

Views: 9064

Answers (2)

user1689987
user1689987

Reputation: 1546

I found it to be much faster and not using so much memory my server crashes to use dataframes to read the csv:

from cStringIO import StringIO
from pandas import *

reportText = StringIO(reportResult.text)
reportDF = read_csv(reportText, sep=',',parse_dates=False)

Then I am able to process it using apply, for example:

def trimFloat(fl):
    if fl is not None:
      res = "{0:.10f}".format(float(fl))
      return res
    else:
      return None

floatCols  = ['a', 'b ']
for col in floatCols:
    reportDF[col] = reportDF[col].apply(trimFloat)


def removePct(reportDF):
  reportDF['c'] = reportDF['c'].apply(lambda x: x.translate(None, '%'))
  return reportDF

I suspect the major issue with the previous attempt had something to do with the UTF8 encoder

Upvotes: 2

reticentroot
reticentroot

Reputation: 3682

For starters, for should use izip from itertools. See below.

from itertools import izip

reportData = []
for row in csvReader:
    combinedDict  = dict(izip(fields, row))
    combinedDict = cleanDict(combinedDict) #cleaned dict method is probably where the bottle neck is
    reportData.append(combinedDict)

in izip is a generator version of zip, which it has a lower memory impact. Though you probably won't have much of a gain since it looks like you're zipping one item at a time. I would take a look at your cleanDict() function. It has tons of if statements to evaluate and as such it takes time. Lastly, if you are really pressed for more speed and can't figure out where to get it from, check using the

from concurrent.futures import ProcessPoolExecutor

or in other words take a look at parallel processing. https://docs.python.org/3/library/concurrent.futures.html

Also please take a look at the PEP 8 guidelines for python. https://www.python.org/dev/peps/pep-0008/ Your indentations are wrong. All indentations should be 4 spaces. If nothing else it helps with readability.

Upvotes: 0

Related Questions