Reputation: 631
I am trying to iterate over 5million records in a csv. I am stuck with the following loop.
trajectory = 0
for index, row in df.iterrows():
if row['trajectory'] == 'NaN':
trajectory = trajectory +1
df.loc[index, 'classification']= trajectory
else:
df.loc[index, 'classification'] = trajectory
When I come across 'NaN' in my DataFrame, I increase my value of trajectory and place the value into my 'classification' column.
I was trying of a smaller dataset, but when I run this code in my full .5 gig csv it takes hours.
Upvotes: 2
Views: 599
Reputation: 863156
Compare with NaN
if string and use cumsum
:
df['classification'] = (df['trajectory'] == 'NaN').cumsum() + trajectory
Or if NaN
is missing value compare by isnull
:
df['classification'] = df['trajectory'].isnull().cumsum() + trajectory
Timings:
np.random.seed(2017)
L = ['s','a','NaN']
N = 1000
df = pd.DataFrame({
'trajectory': np.random.choice(L, size=N)
})
#print (df)
trajectory = 0
def new(df, trajectory):
df['classification'] = (df['trajectory'] == 'NaN').cumsum() + trajectory
return df
def old(df, trajectory):
for index, row in df.iterrows():
if row['trajectory'] == 'NaN':
trajectory = trajectory +1
df.loc[index, 'classification']= trajectory
else:
df.loc[index, 'classification'] = trajectory
return df
In [74]: %timeit (old(df, trajectory))
1 loop, best of 3: 609 ms per loop
In [75]: %timeit (new(df, trajectory))
1000 loops, best of 3: 928 µs per loop
Upvotes: 4