Reputation: 104
I'm building an application that provides some very simple analysis on large datasets. These datasets are delivered in CSV files of 10 million + rows with about 30 columns. (I don't need many of the columns.)
Logic tells me that the entire file into a DataFrame should make it faster to access. But my computer says no.
I've tried loading in batches, as well as loading the entire files, then performing the functions in batches.
But the end result is that it is taking more than 10 times as long to perform the same process, than using a simple file read option.
Here is the DataFrame version:
def runProcess():
global batchSize
batchCount = 10
if rowLimit < 0:
with open(df_srcString) as f:
rowCount = sum(1 for line in f)
if batchSize < 0:
batchSize = batchSize * -1
runProc = readFileDf
else:
runProc = readFileDfBatch
batchCount = int(rowCount / batchSize) + 1
else:
batchCount = int(rowLimit / batchSize) + 1
for i in range(batchCount):
result = runProc(batchSize, i)
print(result)
def readFileDfBatch(batch, batchNo):
sCount = 0
lCount = 0
jobStartTime = datetime.datetime.now()
eof = False
totalRowCount = 0
startRow = batch * batchNo
df_wf = pd.read_csv(df_srcString, sep='|', header=None, names=df_fldHeads.split(','), usecols=df_cols, dtype=str, nrows=batch, skiprows=startRow)
for index, row in df_wf.iterrows():
result = parseDfRow(row)
totalRowCount = totalRowCount + 1
if result == 1:
sCount = sCount + 1
elif result == 2:
lCount = lCount + 1
eof = batch > len(df_wf)
if rowLimit >= 0:
eof = (batch * batchNo >= rowLimit)
jobEndTime = datetime.datetime.now()
runTime = jobEndTime - jobStartTime
return [batchNo, sCount, lCount, totalRowCount, runTime]
def parseDfRow(row):
#df_cols = ['ColumnA','ColumnB','ColumnC','ColumnD','ColumnE','ColumnF']
status = 0
s2 = getDate(row['ColumnB'])
l2 = getDate(row['ColumnD'])
gDate = datetime.date(1970,1,1)
r1 = datetime.date(int(row['ColumnE'][1:5]),12,31)
r2 = row['ColumnF']
if len(r2) > 1:
lastSeen = getLastDate(r2)
else:
lastSeen = r1
status = False
if s2 > lastSeen:
status = 1
elif l2 > lastSeen:
status = 2
return status
And here is the simple file reader version:
def readFileStd(rows, batch):
print("Starting read: ")
batchNo = 1
global targetFile
global totalCount
global sCount
global lCount
targetFile = open(df_srcString, "r")
eof = False
while not eof:
batchStartTime = datetime.datetime.now()
eof = readBatch(batch)
batchEndTime = datetime.datetime.now()
runTime = batchEndTime - batchStartTime
if rows > 0 and totalCount >= rows: break
batchNo = batchNo + 1
targetFile.close()
return [batchNo, sCount, lCount, totalCount, runTime]
def readBatch(batch):
global targetFile
global totalCount
rowNo = 1
rowStr = targetFile.readline()
while rowStr:
parseRow(rowStr)
totalCount = totalCount + 1
if rowNo == batch:
return False
rowStr = targetFile.readline()
rowNo = rowNo + 1
return True
def parseRow(rowData):
rd = rowData.split('|')
s2 = getDate(rd[3])
l2 = getDate(rd[5])
gDate = datetime.date(1970,1,1)
r1 = datetime.date(int(rd[23][1:5]),12,31)
r2 = rd[24]
if len(r2) > 1:
lastSeen = getLastDate(r2)
else:
lastSeen = r1
status = False
if s2 > lastSeen:
global sCount
sCount = sCount + 1
status = True
gDate = s2
elif l2 > lastSeen:
global lCount
lCount = lCount + 1
gDate = s2
Am I doing something wrong?
Upvotes: 1
Views: 2479
Reputation: 16184
a few comments about your code:
global
variables are scaring me! what's wrong with passing parameters and returning state?Pandas
, creating a dataframe just to use it to do a dumb iteration over rows is causing it to do lots of unnecessary workcsv
module (can be used with delimiter='|'
) provides a much closer interface if this is really the best way you can to do thisthis might be a better question for https://codereview.stackexchange.com/
just playing with performance of alternative ways of working row wise. the take home from the below seems to be that working "row wise" is basically always slow with Pandas
start by creating a dataframe to test this:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1, 1e6, (10_000, 2)))
df[1] = df[1].apply(str)
this takes 3.65 ms to create a dataframe with int
and str
columns. next I try the iterrows
approach:
tot = 0
for i, row in df.iterrows():
tot += row[0] / 1e5 < len(row[1])
the aggregation is pretty dumb, I just wanted something that uses both columns. it takes a scary long 903ms. next I try iterating manually:
tot = 0
for i in range(df.shape[0]):
tot += df.loc[i, 0] / 1e5 < len(df.loc[i, 1])
which reduces this down to 408 ms. next I try apply
:
def fn(row):
return row[0] / 1e5 < len(row[1])
sum(df.apply(fn, axis=1))
which is basically the same at 368 ms. finally, I find some code that Pandas is happy with:
sum(df[0] / 1e5 < df[1].apply(len))
which takes 4.15 ms. and another approach that occurred to me:
tot = 0
for a, b in zip(df[0], df[1]):
tot += a / 1e5 < len(b)
which takes 2.78 ms. while another variant:
tot = 0
for a, b in zip(df[0] / 1e5, df[1]):
tot += a < len(b)
takes 2.29 ms.
Upvotes: 1
Reputation: 4265
iterrows
doesn't take advantage of vectorized operations. Most of the benefits of using pandas
come from vectorized and parallel operations.
Replace for index, row in df_wf.iterrows():
with df_wf.apply(something, axis=1)
where something
is a function that encapsulates the logic you needed from iterrows
, and uses numpy
vectorized operations.
Also if your df
doesn't fit in memory such that you need to batch read, consider using dask
or spark
over pandas
.
Further reading: https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html
Upvotes: 1