Reputation: 3728

How to "slice" a pair of lists based on values in one of them

I have two lists of equal length, one containing labels and the other data. For example:

labels = ['cat', 'cat', 'dog', 'dog', 'dog', 'fish', 'fish', 'giraffe', ...]
data = [ 0.3, 0.1, 0.9, 0.5, 0.4, 0.3, 0.2, 0.8, ... ]

How can I extract sub-lists of both lists in parallel based on a particular label in the labels list?

For example, using fish as a selection criteria, I want to generate:

selected_labels = [ 'fish', 'fish' ]
selected_data = [ 0.3, 0.2 ]

My best guess sounds cumbersome - make a list of element-wise tuples, extract a list of relevant tuples from that list, then de-tuple that list of tuples back into two lists of single elements. Even if that's the way to approach it, I'm too new to Python to stumble on the syntax for that.

Upvotes: 2

Answers (5)

juanpa.arrivillaga

Reputation: 96172

The simplest approach is totally fine here, and likely very performant:

>>> selected_labels, selected_data  = [], []
>>> for l, d in zip(labels, data):
...     if l == 'fish':
...         selected_labels.append(l)
...         selected_data.append(d)
...
>>> selected_labels
['fish', 'fish']
>>> selected_data
[0.3, 0.2]

Some more timings, didn't have time to include every approach so far, but here's a few:

>>> labels*=5000
>>> data *= 5000
>>> def juan(data, labels, target):
...     selected_labels, selected_data  = [], []
...     for l, d in zip(labels, data):
...         if l == target:
...             selected_labels.append(l)
...             selected_data.append(d)
...     return selected_labels, selected_data
...
>>> def stephen_rauch(data, labels, target):
...     tuples = (x for x in zip(labels, data) if x[0] == target)
...     selected_labels, selected_data = map(list, zip(*tuples))
...     return selected_labels, selected_data
...
>>> from itertools import compress
>>>
>>> def brad_solomon(data, labels, target):
...     selected_data = list(compress(data, (i==target for i in labels)))
...     selected_labels = ['fish'] * len(selected_data)
...     return selected_data, selected_labels
...
>>> import timeit
>>> setup = "from __main__ import data, labels, juan, stephen_rauch, brad_solomon"
>>> timeit.timeit("juan(data,labels,'fish')", setup, number=1000)
3.1627789690101054
>>> timeit.timeit("stephen_rauch(data,labels,'fish')", setup, number=1000)
3.8860850729979575
>>> timeit.timeit("brad_solomon(data,labels,'fish')", setup, number=1000)
2.7442518350144383

I would say, relying on itertools.compress is doing just fine. I was worried that having to do selected_labels = ['fish'] * len(selected_data) would slow it down, but it is an expression that could be highly optimized in Python (size of the list known ahead of time, and simply repeating the same pointer). Finally, note, the simple, naive approach I gave can be optimized by "caching" the .append method:

>>> def juan(data, labels, target):
...     selected_labels, selected_data  = [], []
...     append_label = selected_labels.append
...     append_data = selected_data.append
...     for l, d in zip(labels, data):
...         if l == target:
...             append_label(l)
...             append_data(d)
...     return selected_labels, selected_data
...
>>> timeit.timeit("juan(data,labels,'fish')", setup, number=1000)
2.577823764993809

Upvotes: 2

Alan Hoover

Reputation: 1450

As an alternative to the zip answer, you might consider using a different data structure. I would put that in a dict

data = {'cat' : [0.3, 0.1],
        'dog' : [0.9, 0.5, 0.4],
        'fish' : [0.3, 0.2],
        'giraffe' : [0.8],
        # ...
        }

Then to access, just data['fish'] will give [0.3, 0.2]

You can load the data you have into such a dictby doing this one time only

data2 = {}
for label, datum in zip(labels,data):
    if label not in data2:
        data2[label] = []
    data2[label].append(datum)

Then just do this for each query

select = 'fish'
selected_data = data2[select]
selected_labels = [select] * len(selected_data)

Upvotes: 0

Brad Solomon

Reputation: 40918

This might be a good place to apply itertools.compress, which is slightly faster than zip, at least for the size of data structures you're working with.

from itertools import compress

selected_data = list(compress(data, (i=='fish' for i in labels)))
selected_labels = ['fish'] * len(selected_data)

Usage:

compress('ABCDEF', [1,0,1,0,1,1]) --> A C E F

Timing:

def with_compress():
    selected_data = list(compress(data, (i=='fish' for i in labels)))
    selected_labels = ['fish'] * len(selected_data)
    return selected_data, selected_labels

def with_zip():
    tuples = (x for x in zip(labels, data) if x[0] == 'fish')
    selected_labels, selected_data = map(list, zip(*tuples))
    return selected_data, selected_labels

%timeit -r 7 -n 100000 with_compress()
3.82 µs ± 96.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit -r 7 -n 100000 with_zip()
4.67 µs ± 348 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

(i=='fish' for i in labels) is a generator of True and False. compress filters down data element-wise to cases where True occurs.

From the docstring:

Roughly equivalent to:

def compress(data, selectors):
    # compress('ABCDEF', [1,0,1,0,1,1]) --> A C E F
    return (d for d, s in zip(data, selectors) if s)

Upvotes: 3

Stephen Rauch

Reputation: 49812

Using zip() and a generator expression this can be done like:

Code:

tuples = (x for x in zip(labels, data) if x[0] == 'fish')
selected_labels, selected_data = map(list, zip(*tuples))

How does this work?

The tuples line builds a generator expression which zips the two lists together and drops any thing that is uninteresting. The second line uses zip again and then maps the resulting tuples into lists as desired.

This has the advantage of building no intermediate data structures so should be fairly fast and memory efficient.

Test Code:

labels = ['cat', 'cat', 'dog', 'dog', 'dog', 'fish', 'fish', 'giraffe']
data = [0.3, 0.1, 0.9, 0.5, 0.4, 0.3, 0.2, 0.8]

tuples = (x for x in zip(labels, data) if x[0] == 'fish')
selected_labels, selected_data = map(list, zip(*tuples))

print(selected_labels)
print(selected_data)

Results:

['fish', 'fish']
[0.3, 0.2]

Upvotes: 4

Sohaib Farooqi

Reputation: 5666

You can zip the lists together, filter them based on the keyword you are looking for and then unzip

>>> items = zip(*filter(lambda x: x[0] == "fish",zip(labels,data)))
>>> list(items)
>>> [('fish', 'fish'), (0.3, 0.2)]

Then your selected_data and selected_labels would be:

>>> selected_data = list(items[1])
>>> selected_labels = list(items[0])

Another alternative is to use map function to get the desired format:

 >>> items = map(list,zip(*filter(lambda x: x[0] == "fish",zip(labels,data))))
>>> list(items) 
>>> [['fish', 'fish'], [0.3, 0.2]]

Upvotes: 2

How to &quot;slice&quot; a pair of lists based on values in one of them

Answers (5)

Code:

How does this work?

Test Code:

Results:

Related Questions

How to "slice" a pair of lists based on values in one of them