Clustering strings of a list and return a list of lists

Question

I have a list of strings as the following one:

a = ['aaa-t1', 'aaa-t2', 'aab-t1', 'aab-t2', 'aab-t3', 'abc-t2']

I would like to cluster those strings by similarity. As you may note, a[0], and a[1] share the same root: aaa. I would like to produce a new list of lists that looks like this:

b = [['aaa-t1', 'aaa-t2'], ['aab-t1', 'aab-t2', 'aab-t3'], ['abc-t2']]

What would be a way to do so?. So far I have not succeeded and I don't have any decent code to show. I was trying comparing strings with fuzzywuzzy but doing so requires creating possible combinations of strings and that scales badly with list's length.

niemmi · Accepted Answer

You can use groupby to group the strings by key generated with str.split:

>>> from itertools import groupby
>>> a = ['aaa-t1', 'aaa-t2', 'aab-t1', 'aab-t2', 'aab-t3', 'abc-t2']
>>> [list(g) for k, g in groupby(sorted(a), lambda x: x.split('-', 1)[0])]
[['aaa-t1', 'aaa-t2'], ['aab-t1', 'aab-t2', 'aab-t3'], ['abc-t2']]

groupby returns an iterable of tuples (key, group) where key is a key used for grouping and group is iterable of items in the group. First parameter given to groupby is the iterable to produce groups from and optional second parameter is a key function that is called to produce a key. Since groupby only groups the consecutive elements a needs to be sorted first.

Clustering strings of a list and return a list of lists

Answers (1)

Related Questions