Reputation: 2640
If I have a list of strings, such as:
lst = ['aa bb', 'cc dd', 'cc aa']
How can I get this into a list of unique words such as this:
['aa', 'bb', 'cc', 'dd']
using a comprehension? Here's as far as I've gotten, to no avail:
wordList = [x.split() for row in lst for x in row]
Upvotes: 0
Views: 145
Reputation: 8620
In [25]: list({y for x in lst for y in x.split()})
Out[25]: ['aa', 'cc', 'dd', 'bb']
To keep it order, means remove duplicates from a list, you can refer http://www.peterbe.com/plog/uniqifiers-benchmark.
Upvotes: 1
Reputation: 437
The simplest approach I think is probably this, although not the most efficient.
set(' '.join(lst).split())
If you really want a list, then just wrap that in a call to list()
Upvotes: 1
Reputation: 1125058
You want to loop over the split values:
wordList = [word for row in lst for word in row.split()]
then use a set to make the whole list unique:
wordList = list({word for row in lst for word in row.split()})
or just use a set and be done with it:
wordList = {word for row in lst for word in row.split()}
Demo:
>>> lst = ['aa bb', 'cc dd', 'cc aa']
>>> list({word for row in lst for word in row.split()})
['aa', 'cc', 'dd', 'bb']
>>> {word for row in lst for word in row.split()}
set(['aa', 'cc', 'dd', 'bb'])
If order matters (the above code returns words in arbitrary order, the sorted order is a coincidence by virtue of the implementation details of CPython), use a separate set to track duplicate values:
seen = set()
wordList = [word for row in lst for word in row.split() if word not in seen and not seen.add(word)]
To illustrate the difference, a better input sample:
>>> lst = ['the quick brown fox', 'brown speckled hen', 'the hen and the fox']
>>> seen = set()
>>> [word for row in lst for word in row.split() if word not in seen and not seen.add(word)]
['the', 'quick', 'brown', 'fox', 'speckled', 'hen', 'and']
>>> {word for row in lst for word in row.split()}
set(['and', 'brown', 'fox', 'speckled', 'quick', 'the', 'hen'])
Upvotes: 2
Reputation: 60024
For keeping order, you can do something like:
>>> from collections import OrderedDict
>>> lst = ['aa bb', 'cc dd', 'cc aa']
>>> new = []
>>> for i in lst:
... new.extend(i.split())
...
>>> list(OrderedDict.fromkeys(new))
['aa', 'bb', 'cc', 'dd']
Note that using a set()
is most likely faster, as Martijn has pointed out.
Upvotes: 1