StatsViaCsh
StatsViaCsh

Reputation: 2640

How do I create a list of words from a list of sentences?

If I have a list of strings, such as:

lst =  ['aa bb', 'cc dd', 'cc aa']

How can I get this into a list of unique words such as this:

['aa', 'bb', 'cc', 'dd']

using a comprehension? Here's as far as I've gotten, to no avail:

wordList = [x.split() for row in lst for x in row]

Upvotes: 0

Views: 145

Answers (4)

zhangyangyu
zhangyangyu

Reputation: 8620

In [25]: list({y for x in lst for y in x.split()})
Out[25]: ['aa', 'cc', 'dd', 'bb']

To keep it order, means remove duplicates from a list, you can refer http://www.peterbe.com/plog/uniqifiers-benchmark.

Upvotes: 1

RussW
RussW

Reputation: 437

The simplest approach I think is probably this, although not the most efficient.

set(' '.join(lst).split())

If you really want a list, then just wrap that in a call to list()

Upvotes: 1

Martijn Pieters
Martijn Pieters

Reputation: 1125058

You want to loop over the split values:

wordList = [word for row in lst for word in row.split()]

then use a set to make the whole list unique:

wordList = list({word for row in lst for word in row.split()})

or just use a set and be done with it:

wordList = {word for row in lst for word in row.split()}

Demo:

>>> lst =  ['aa bb', 'cc dd', 'cc aa']
>>> list({word for row in lst for word in row.split()})
['aa', 'cc', 'dd', 'bb']
>>> {word for row in lst for word in row.split()}
set(['aa', 'cc', 'dd', 'bb'])

If order matters (the above code returns words in arbitrary order, the sorted order is a coincidence by virtue of the implementation details of CPython), use a separate set to track duplicate values:

seen = set()
wordList = [word for row in lst for word in row.split() if word not in seen and not seen.add(word)]

To illustrate the difference, a better input sample:

>>> lst = ['the quick brown fox', 'brown speckled hen', 'the hen and the fox']
>>> seen = set()
>>> [word for row in lst for word in row.split() if word not in seen and not seen.add(word)]
['the', 'quick', 'brown', 'fox', 'speckled', 'hen', 'and']
>>> {word for row in lst for word in row.split()}
set(['and', 'brown', 'fox', 'speckled', 'quick', 'the', 'hen'])

Upvotes: 2

TerryA
TerryA

Reputation: 60024

For keeping order, you can do something like:

>>> from collections import OrderedDict
>>> lst =  ['aa bb', 'cc dd', 'cc aa']
>>> new = []
>>> for i in lst:
...     new.extend(i.split())
...
>>> list(OrderedDict.fromkeys(new))
['aa', 'bb', 'cc', 'dd']

Note that using a set() is most likely faster, as Martijn has pointed out.

Upvotes: 1

Related Questions