Reputation: 967
I have a list of strings as the following one:
a = ['aaa-t1', 'aaa-t2', 'aab-t1', 'aab-t2', 'aab-t3', 'abc-t2']
I would like to cluster those strings by similarity. As you may note, a[0]
, and a[1]
share the same root: aaa
. I would like to produce a new list of lists that looks like this:
b = [['aaa-t1', 'aaa-t2'], ['aab-t1', 'aab-t2', 'aab-t3'], ['abc-t2']]
What would be a way to do so?. So far I have not succeeded and I don't have any decent code to show. I was trying comparing strings with fuzzywuzzy
but doing so requires creating possible combinations of strings and that scales badly with list's length.
Upvotes: 2
Views: 686
Reputation: 17263
You can use groupby
to group the strings by key generated with str.split
:
>>> from itertools import groupby
>>> a = ['aaa-t1', 'aaa-t2', 'aab-t1', 'aab-t2', 'aab-t3', 'abc-t2']
>>> [list(g) for k, g in groupby(sorted(a), lambda x: x.split('-', 1)[0])]
[['aaa-t1', 'aaa-t2'], ['aab-t1', 'aab-t2', 'aab-t3'], ['abc-t2']]
groupby
returns an iterable of tuples (key, group)
where key
is a key used for grouping and group
is iterable of items in the group. First parameter given to groupby
is the iterable to produce groups from and optional second parameter is a key function that is called to produce a key. Since groupby
only groups the consecutive elements a
needs to be sorted first.
Upvotes: 6