Tom M
Tom M

Reputation: 1322

How do I store a numpy array as an object in a pandas dataframe?

I have a series of images, that are stored in a CVS file as one string per image, the string is a list of 9216 space separated integers. I have a function that converts this to a 96x96 numpy array.

I wish to store this numpy array in a column of my dataframe instead of the string.

However when i retrieve the item from the column it is no longer usable as a numpy array.

Data can be dowloaded from here, the last column in the training.cvs file.

https://www.kaggle.com/c/facial-keypoints-detection/data

import pandas as pd
import numpy as np

df_train = pandas.read_csv("training.csv")

def convert_to_np_arr(im_as_str):
    im = [int(i) for i in im_as_str.split()]
    im = np.asarray(im)
    im = im.reshape((96, 96))
    return im

df_train['Im_as_np'] = df_train.Image.apply(convert_to_np_arr)

im = df_train.Im_as_np[0]
plt.imshow(im, cmap = cm.Greys_r)
plt.show()

If instead of using the function and applying and storing the image, I use the code directly it works as expected

import pandas as pd
import numpy as np

df_train = pandas.read_csv("training.csv")

im = df_train.Image[0]
im = [int(i) for i in im.split()]
im = np.asarray(im)
im = im.reshape((96, 96))

plt.imshow(im, cmap = cm.Greys_r)
plt.show()

Upvotes: 5

Views: 4484

Answers (2)

ely
ely

Reputation: 77404

Pandas does not tend to be a suitable data structure for handling images. Generally, the assumption with Pandas is that the number of columns is much less than the number of rows. This of course doesn't need to be true, and for DataFrames that are small in both dimensions, it rarely matters. But for mathematical operations that are natural in a spatial sense, the relational structure of the DataFrame is not appropriate, and this shows as the number of columns grows. Given this, I would suggest just using NumPy's csv-reading abilities and working with it as a 2d array or an image object, with e.g. scikits.image.

Upvotes: 3

Happy001
Happy001

Reputation: 6383

The way you store it should be correct. It's just harder to access data. Instead of im=df_train.Im_as_np[0] use ix to access data:

im=df_train.ix[0,'Im_as_np']

Upvotes: 1

Related Questions