Reputation: 3728
I have two lists of equal length, one containing labels
and the other data
. For example:
labels = ['cat', 'cat', 'dog', 'dog', 'dog', 'fish', 'fish', 'giraffe', ...]
data = [ 0.3, 0.1, 0.9, 0.5, 0.4, 0.3, 0.2, 0.8, ... ]
How can I extract sub-lists of both lists in parallel based on a particular label in the labels
list?
For example, using fish
as a selection criteria, I want to generate:
selected_labels = [ 'fish', 'fish' ]
selected_data = [ 0.3, 0.2 ]
My best guess sounds cumbersome - make a list of element-wise tuples, extract a list of relevant tuples from that list, then de-tuple that list of tuples back into two lists of single elements. Even if that's the way to approach it, I'm too new to Python to stumble on the syntax for that.
Upvotes: 2
Views: 443
Reputation: 96172
The simplest approach is totally fine here, and likely very performant:
>>> selected_labels, selected_data = [], []
>>> for l, d in zip(labels, data):
... if l == 'fish':
... selected_labels.append(l)
... selected_data.append(d)
...
>>> selected_labels
['fish', 'fish']
>>> selected_data
[0.3, 0.2]
Some more timings, didn't have time to include every approach so far, but here's a few:
>>> labels*=5000
>>> data *= 5000
>>> def juan(data, labels, target):
... selected_labels, selected_data = [], []
... for l, d in zip(labels, data):
... if l == target:
... selected_labels.append(l)
... selected_data.append(d)
... return selected_labels, selected_data
...
>>> def stephen_rauch(data, labels, target):
... tuples = (x for x in zip(labels, data) if x[0] == target)
... selected_labels, selected_data = map(list, zip(*tuples))
... return selected_labels, selected_data
...
>>> from itertools import compress
>>>
>>> def brad_solomon(data, labels, target):
... selected_data = list(compress(data, (i==target for i in labels)))
... selected_labels = ['fish'] * len(selected_data)
... return selected_data, selected_labels
...
>>> import timeit
>>> setup = "from __main__ import data, labels, juan, stephen_rauch, brad_solomon"
>>> timeit.timeit("juan(data,labels,'fish')", setup, number=1000)
3.1627789690101054
>>> timeit.timeit("stephen_rauch(data,labels,'fish')", setup, number=1000)
3.8860850729979575
>>> timeit.timeit("brad_solomon(data,labels,'fish')", setup, number=1000)
2.7442518350144383
I would say, relying on itertools.compress
is doing just fine. I was worried that having to do selected_labels = ['fish'] * len(selected_data)
would slow it down, but it is an expression that could be highly optimized in Python (size of the list known ahead of time, and simply repeating the same pointer). Finally, note, the simple, naive approach I gave can be optimized by "caching" the .append
method:
>>> def juan(data, labels, target):
... selected_labels, selected_data = [], []
... append_label = selected_labels.append
... append_data = selected_data.append
... for l, d in zip(labels, data):
... if l == target:
... append_label(l)
... append_data(d)
... return selected_labels, selected_data
...
>>> timeit.timeit("juan(data,labels,'fish')", setup, number=1000)
2.577823764993809
Upvotes: 2
Reputation: 1450
As an alternative to the zip
answer, you might consider using a different data structure. I would put that in a dict
data = {'cat' : [0.3, 0.1],
'dog' : [0.9, 0.5, 0.4],
'fish' : [0.3, 0.2],
'giraffe' : [0.8],
# ...
}
Then to access, just data['fish']
will give [0.3, 0.2]
You can load the data you have into such a dict
by doing this one time only
data2 = {}
for label, datum in zip(labels,data):
if label not in data2:
data2[label] = []
data2[label].append(datum)
Then just do this for each query
select = 'fish'
selected_data = data2[select]
selected_labels = [select] * len(selected_data)
Upvotes: 0
Reputation: 40918
This might be a good place to apply itertools.compress
, which is slightly faster than zip
, at least for the size of data structures you're working with.
from itertools import compress
selected_data = list(compress(data, (i=='fish' for i in labels)))
selected_labels = ['fish'] * len(selected_data)
Usage:
compress('ABCDEF', [1,0,1,0,1,1]) --> A C E F
Timing:
def with_compress():
selected_data = list(compress(data, (i=='fish' for i in labels)))
selected_labels = ['fish'] * len(selected_data)
return selected_data, selected_labels
def with_zip():
tuples = (x for x in zip(labels, data) if x[0] == 'fish')
selected_labels, selected_data = map(list, zip(*tuples))
return selected_data, selected_labels
%timeit -r 7 -n 100000 with_compress()
3.82 µs ± 96.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit -r 7 -n 100000 with_zip()
4.67 µs ± 348 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
(i=='fish' for i in labels)
is a generator of True
and False
. compress
filters down data
element-wise to cases where True
occurs.
From the docstring:
Roughly equivalent to:
def compress(data, selectors): # compress('ABCDEF', [1,0,1,0,1,1]) --> A C E F return (d for d, s in zip(data, selectors) if s)
Upvotes: 3
Reputation: 49812
Using zip()
and a generator expression this can be done like:
tuples = (x for x in zip(labels, data) if x[0] == 'fish')
selected_labels, selected_data = map(list, zip(*tuples))
The tuples
line builds a generator expression which zips the two lists together and drops any thing that is uninteresting. The second line uses zip again and then map
s the resulting tuples into list
s as desired.
This has the advantage of building no intermediate data structures so should be fairly fast and memory efficient.
labels = ['cat', 'cat', 'dog', 'dog', 'dog', 'fish', 'fish', 'giraffe']
data = [0.3, 0.1, 0.9, 0.5, 0.4, 0.3, 0.2, 0.8]
tuples = (x for x in zip(labels, data) if x[0] == 'fish')
selected_labels, selected_data = map(list, zip(*tuples))
print(selected_labels)
print(selected_data)
['fish', 'fish']
[0.3, 0.2]
Upvotes: 4
Reputation: 5666
You can zip
the lists together, filter them based on the keyword you are looking for and then unzip
>>> items = zip(*filter(lambda x: x[0] == "fish",zip(labels,data)))
>>> list(items)
>>> [('fish', 'fish'), (0.3, 0.2)]
Then your selected_data
and selected_labels
would be:
>>> selected_data = list(items[1])
>>> selected_labels = list(items[0])
Another alternative is to use map
function to get the desired format:
>>> items = map(list,zip(*filter(lambda x: x[0] == "fish",zip(labels,data))))
>>> list(items)
>>> [['fish', 'fish'], [0.3, 0.2]]
Upvotes: 2