BorangeOrange1337
BorangeOrange1337

Reputation: 303

Faster Way To Iterate Through Rows of a CSV?

I have a CSV file that is nearly ~28,000 rows x 785 columns. I need to 1). separate out the column header, 2) put the first column of every row into a labels array and 3). turn the remaining 784 columns of each row into a 28x28 matrix and append them to my images array after transforming their values to floats.

Is there is a faster way to iterate through my CSV?

    images = np.array([])
    labels = np.array([])

    with open(filename) as training_file:
        reader = csv.reader(training_file, delimiter=',')
        header = np.array(next(reader))

        for row in reader:
            label = row[0] # get each row's label

            pixels = row[1:785] # get pixel values of each row
            pixels = np.array(pixels).astype(float) # transform pixel values to floats
            pixels = pixels.reshape(28,28) # turn into 28x28 matrix

            labels = np.append(labels, np.array(label)) # append to labels array
            images = np.append(images, np.array(pixels)) # append to images array

Upvotes: 0

Views: 2772

Answers (4)

BorangeOrange1337
BorangeOrange1337

Reputation: 303

As suggested by a few folks:

  • Its computationally expensive to re-create arrays and constantly append to them. Instead, I have created empty arrays at the get-go. That made what was already a relative quick computation that much faster.
    with open(filename) as training_file:
        reader = csv.reader(training_file, delimiter=',')
        header = np.array(next(reader)) # column headers

        row_count = len(list(reader))

        images = np.empty((row_count, 784)) # empty array
        labels = np.empty((row_count,)) # empty array

        for row in reader:
            labels.append(row[0]) # get each row's label
            images.append(row[1:785]) # get pixel values of each row

    labels = labels.astype(float)
    images = images.reshape(-1, 28,28).astype(float)

Upvotes: 0

Bobby Ocean
Bobby Ocean

Reputation: 3328

I think creating arrays is expensive. Appending to array's re-creates them in the background and is also expensive. You could allocate all the memory at once, like:

x = np.empty((28000,784))

then save each row to each row of the array. Updating an array is extremely fast and highly optimized. When you are done, you can change the shape, x.shape = (28000,28,28). Note, that array shape and memory allocation are disconnected in numpy, hence reshaping an array costs nothing (it simply updates how to access the values, doesn't move the values around). This means that there is no reason to reshape each individual row before appending to the array.

Upvotes: 1

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 96277

The iteration take almost no time. The problem is you are using a highly inefficient approach for creating your arrays.

Never do this in a loop with numpy.ndarray objects:

labels = np.append(labels, np.array(label)) # append to labels array
images = np.append(images, np.array(pixels)) # append to images array

Instead, make labels and images lists:

labels = []
images = []

And then in your loop, append to the list objects (a highly efficient operation):

labels.append(np.array(label)) # append to labels list
images.append(np.array(pixels)) # append to images list

Then finally, after your loop is done, convert the list of arrays to an array:

labels = np.array(labels)
images = np.array(images)

Note, I'm not sure what the shape is of the final arrays you are expecting, you may need to reshape the result. Your approach would flatten the final array, with each .append, because you do not specify an axis... if that's truly what you want then labels.ravel() would get you that in the end

Upvotes: 0

Angelo Mendes
Angelo Mendes

Reputation: 978

You would use pandas to read your csv file.

import pandas as pd
csv_file = pd.read_csv('file.csv')

The columns are accessed by csv_file.name.

Depending on the data size, you can read your file by chunks:

import pandas as pd
csv_file = pd.read_csv('file.csv', chunksize=1)

Anyway, read in the pandas documentation that I believe is the best way out

Upvotes: 1

Related Questions