Reputation: 344
I have a large matrix (1,017,209 rows) from which I need to read out elements, make operations on them, and collecting the results into lists. When I do it on 10,000 rows or even 100,000 it finishes in a reasonable time, however 1,000,000 does not. Here is my code:
import pandas as pd
data = pd.read_csv('scaled_train.csv', index_col=False, header=0)
new = data.as_matrix()
def vectorized_id(j):
"""Return a 1115-dimensional unit vector with a 1.0 in the j-1'th position
and zeroes elsewhere. This is used to convert the store ids (1...1115)
into a corresponding desired input for the neural network.
"""
j = j - 1
e = [0] * 1115
e[j] = 1.0
return e
def vectorized_day(j):
"""Return a 7-dimensional unit vector with a 1.0 in the j-1'th position
and zeroes elsewhere. This is used to convert the days (1...7)
into a corresponding desired input for the neural network.
"""
j = j - 1
e = [0] * 7
e[j] = 1.0
return e
list_b = []
list_a = []
for x in xrange(0,1017209):
a1 = vectorized_id(new[x][0])
a2 = vectorized_day(new[x][1])
a3 = [new[x][5]]
a = a1 + a2 + a3
b = new[x][3]
list_a.append(a)
list_b.append(b)
What makes it slow at that scale (what is the bottleneck)? Are there ways to optimize it?
Upvotes: 1
Views: 65
Reputation: 11531
A couple of things:
csv.reader
for loading your data.new
list.Upvotes: 1