Balázs Fehér
Balázs Fehér

Reputation: 344

Efficient list operations

I have a large matrix (1,017,209 rows) from which I need to read out elements, make operations on them, and collecting the results into lists. When I do it on 10,000 rows or even 100,000 it finishes in a reasonable time, however 1,000,000 does not. Here is my code:

import pandas as pd

data = pd.read_csv('scaled_train.csv', index_col=False, header=0)
new = data.as_matrix()

def vectorized_id(j):
    """Return a 1115-dimensional unit vector with a 1.0 in the j-1'th position
    and zeroes elsewhere.  This is used to convert the store ids (1...1115)
    into a corresponding desired input for the neural network.
    """
    j = j - 1    
    e = [0] * 1115
    e[j] = 1.0
    return e

def vectorized_day(j):
    """Return a 7-dimensional unit vector with a 1.0 in the j-1'th position
    and zeroes elsewhere.  This is used to convert the days (1...7)
    into a corresponding desired input for the neural network.
    """
    j = j - 1
    e = [0] * 7
    e[j] = 1.0
    return e

list_b = []
list_a = []

for x in xrange(0,1017209):
    a1 = vectorized_id(new[x][0])
    a2 = vectorized_day(new[x][1])
    a3 = [new[x][5]]
    a = a1 + a2 + a3
    b = new[x][3]
    list_a.append(a)
    list_b.append(b)

What makes it slow at that scale (what is the bottleneck)? Are there ways to optimize it?

Upvotes: 1

Views: 65

Answers (1)

John Percival Hackworth
John Percival Hackworth

Reputation: 11531

A couple of things:

  1. Don't read in the entire file at once, you don't appear to be doing anything that requires multiple lines.
  2. Look at using csv.reader for loading your data.
  3. Really stop indexing in the giant new list.

Upvotes: 1

Related Questions