Reputation: 303
I have a CSV file that is nearly ~28,000 rows x 785 columns. I need to 1). separate out the column header
, 2) put the first column of every row into a labels
array and 3). turn the remaining 784 columns of each row into a 28x28 matrix and append them to my images
array after transforming their values to floats.
Is there is a faster way to iterate through my CSV?
images = np.array([])
labels = np.array([])
with open(filename) as training_file:
reader = csv.reader(training_file, delimiter=',')
header = np.array(next(reader))
for row in reader:
label = row[0] # get each row's label
pixels = row[1:785] # get pixel values of each row
pixels = np.array(pixels).astype(float) # transform pixel values to floats
pixels = pixels.reshape(28,28) # turn into 28x28 matrix
labels = np.append(labels, np.array(label)) # append to labels array
images = np.append(images, np.array(pixels)) # append to images array
Upvotes: 0
Views: 2772
Reputation: 303
As suggested by a few folks:
with open(filename) as training_file:
reader = csv.reader(training_file, delimiter=',')
header = np.array(next(reader)) # column headers
row_count = len(list(reader))
images = np.empty((row_count, 784)) # empty array
labels = np.empty((row_count,)) # empty array
for row in reader:
labels.append(row[0]) # get each row's label
images.append(row[1:785]) # get pixel values of each row
labels = labels.astype(float)
images = images.reshape(-1, 28,28).astype(float)
Upvotes: 0
Reputation: 3328
I think creating arrays is expensive. Appending to array's re-creates them in the background and is also expensive. You could allocate all the memory at once, like:
x = np.empty((28000,784))
then save each row to each row of the array. Updating an array is extremely fast and highly optimized. When you are done, you can change the shape, x.shape = (28000,28,28). Note, that array shape and memory allocation are disconnected in numpy, hence reshaping an array costs nothing (it simply updates how to access the values, doesn't move the values around). This means that there is no reason to reshape each individual row before appending to the array.
Upvotes: 1
Reputation: 96277
The iteration take almost no time. The problem is you are using a highly inefficient approach for creating your arrays.
Never do this in a loop with numpy.ndarray
objects:
labels = np.append(labels, np.array(label)) # append to labels array
images = np.append(images, np.array(pixels)) # append to images array
Instead, make labels
and images
lists:
labels = []
images = []
And then in your loop, append to the list objects (a highly efficient operation):
labels.append(np.array(label)) # append to labels list
images.append(np.array(pixels)) # append to images list
Then finally, after your loop is done, convert the list of arrays to an array:
labels = np.array(labels)
images = np.array(images)
Note, I'm not sure what the shape is of the final arrays you are expecting, you may need to reshape
the result. Your approach would flatten the final array, with each .append
, because you do not specify an axis... if that's truly what you want then labels.ravel()
would get you that in the end
Upvotes: 0
Reputation: 978
You would use pandas
to read your csv file.
import pandas as pd
csv_file = pd.read_csv('file.csv')
The columns are accessed by csv_file.name
.
Depending on the data size, you can read your file by chunks:
import pandas as pd
csv_file = pd.read_csv('file.csv', chunksize=1)
Anyway, read in the pandas documentation that I believe is the best way out
Upvotes: 1