Reputation: 3718

"Slicing" a pair of numpy arrays based on values in one of them

Conceptually, I have two lists of equal length, one containing labels and the other data. And so I asked this question, not realising that what I really had was two numpy arrays, not two lists.

What I do have is a folder containing images such as cat_01.jpg, cat_02.jpg, dog_01.jpg, dog_02.jpg, dog_03.jpg, fish_01.jpg, ..., tiger_03.jpg, zebra_01.jpg and zebra_02.jpg. I also have a successful program to read them in, parse a portion of each file name into a labels array, and the corresponding image data into my data array, so that I end up with something like:

>>> labels
array(['cat', 'cat', 'dog',  ..., 'tiger', 'zebra', 'zebra' ])
>>> type( data )
<class 'numpy.ndarray'>
>>> data[0][0][0]
array([78, 88, 98])

That makes sense - in each sample at (column, row), data[ sample ][ row ][ column ] represents an (R,G,B) data point.

I want to specify a search label such as 'dog', and (conceptually) use it to generate two "sub-lists" - the first containing all the (identical) matching labels in the labels list, and the other containing the associated image data from data. But rather than lists, I need to retain the original data format, in this case numpy arrays (but if there is a more general, data-insensitive approach, I'd love to know about it) . How can I do this?

Update: here's some specific test code to recreate the situation I am confronting, and with a sketch of a solution based on Stephen Rauch's answer:

import os, glob
from PIL import Image
import numpy as np
import pandas as pd    # not critical to question

def load_image(file):
  data = np.asarray(Image.open(file),dtype="float")
  return data

MasterClass = ['cat','dog','fsh','grf','hrs','leo','owl','pig','tgr','zbr']
os.chdir('data\\animals')
filelist = glob.glob("*.jpg")

full_labels = np.array([MasterClass.index(os.path.basename(fname)[:3]) for fname in filelist])
full_images = np.array([load_image(fname) for fname in filelist])
# The following sketch a solution, but which leads to incompatible data types
# That is, the test_images differ from the full_images and/or so do the labels
# with regard to the data types involved.
df = pd.DataFrame(dict(label=list(full_labels),data=list(full_images)))
criteria = df['label'] == MasterClass.index('dog')
test_labels = np.array(df[criteria]['label'])
test_images = np.array(df[criteria]['data'])

Two notes:

When originally I wrote that there were file names "such as" tiger_03.jpg, I was de-obfuscating reality. In truth the code above expects file names like tgr03.jpg, and the list of labels I end up working with is not even ['cat', 'cat', 'dog', ...] but is instead a list of indices in the MasterClass list - that is, [0, 0, 1, ...]
For test purposes the contents of the files don't actually matter, so long as they are valid (JPEG) images. You can easily test with a handful of (identical) files in a folder with a handful of different names.

The question is: how do I get test_labels and test_images to be in an identical format to the original full_labels and full_images but based on a selection criteria like the one sketched above? This code as it stands does not achieve this level of data compatibility - it does not achieve a strict "slice" function.

Upvotes: 0

Answers (3)

omatai

Reputation: 3718

Based on Stephen Rauch's answer to my earlier simpler question, it is possible to solve this as follows:

# assume full_labels and full_images exist as per test code in updated question
tuples = (x for x in zip(list(full_labels),list(full_images)) if x[0] == MasterClass.index('dog'))
xlabels,ximages = map(list, zip(*tuples))
test_labels = np.array(xlabels)
test_images = np.array(ximages)

Upvotes: 0

Stephen Rauch

Reputation: 49784

If you can use pandas, it is VERY good at this sort of thing.

Code:

If you already have a dataframe, you can simply do:

# build a logical condition
have_dog = df['animal_label'] == 'dog'

# select the data when that condition is true
print(df[have_dog])

Test Code:

import pandas as pd
import numpy as np

animal_label = ['cat', 'cat', 'dog', 'dog', 'dog', 'fish', 'fish', 'giraffe']
data = [0.3, 0.1, 0.9, 0.5, 0.4, 0.3, 0.2, 0.8]
data = [np.array((x,) * 3) for x in data]

df = pd.DataFrame(dict(animal_label=animal_label, data=data))
print(df)

have_dog = df['animal_label'] == 'dog'
print(df[have_dog])

Results:

  animal_label             data
0          cat  [0.3, 0.3, 0.3]
1          cat  [0.1, 0.1, 0.1]
2          dog  [0.9, 0.9, 0.9]
3          dog  [0.5, 0.5, 0.5]
4          dog  [0.4, 0.4, 0.4]
5         fish  [0.3, 0.3, 0.3]
6         fish  [0.2, 0.2, 0.2]
7      giraffe  [0.8, 0.8, 0.8]

  animal_label             data
2          dog  [0.9, 0.9, 0.9]
3          dog  [0.5, 0.5, 0.5]
4          dog  [0.4, 0.4, 0.4]

Upvotes: 1

rammelmueller

Reputation: 1118

If I understand your problem correctly, this would be done by slicing like this:

selector = 'fish'
matching_labels = labels[labels==selector]
matching_data = data[labels==selector]

Alternatively, you could use the approach from the answer in your previous question and make the list alist a numpy array by alist = numpy.array(alist)

Upvotes: 0

&quot;Slicing&quot; a pair of numpy arrays based on values in one of them

Answers (3)

Code:

Test Code:

Results:

Related Questions

"Slicing" a pair of numpy arrays based on values in one of them