J.P.
J.P.

Reputation: 141

Identifying groups of similar numbers in a list

I have lists of numbers that I'd like to group by similarity. The order of the numbers in the list is fixed and important to preserve.

As an example, here's a visualisation of what I'm trying to achieve:

Black line is the list of numbers, green lines are the identified groups of similar numbers I'd like to identify, corresponding with that section of the list.

The black line represents the list of numbers I have. The green lines represent the groupings I would like to identify in this example list.

The order of numbers in the list is important and cannot be changed (e.g. cannot sort ascending or descending). The numbers in the list are not contiguous (i.e. there isn't likely to be a list of 6, 6, 6, 6, but probably would be something like 5.85, 6.1, 5.96, 5.88).

Is there a method to do this?

Edit: example values, and desired groupings:

[4.1, 4.05, 4.14, 4.01, 3.97, 4.52, 4.97, 5.02, 5.05, 5.2, 5.18, 3.66, 3.77, 3.59, 3.72]

would result in an approximate grouping of

[(4.1, 4.05, 4.14, 4.01, 3.97, 4.52), (4.97, 5.02, 5.05, 5.2, 5.18), (3.66, 3.77, 3.59, 3.72)]

In the grouping above, you could argue that 4.52 could belong to the first or second group. If visualised as I did in the example above, the groupings would be represented by the green lines. My lists are actually several hundred to several thousand values in length.

Upvotes: 2

Views: 927

Answers (5)

user5722782
user5722782

Reputation: 11

https://en.wikipedia.org/wiki/K-means_clustering k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

Upvotes: 1

Padraic Cunningham
Padraic Cunningham

Reputation: 180411

A numpy version:

l = [4.1, 4.05, 4.14, 4.01, 3.97, 4.52, 4.97, 5.02, 5.05, 5.2, 5.18, 3.66, 3.77, 3.59, 3.72]

import numpy as np

x = np.array(l)
mask = np.diff(np.round(x))
print(np.split(x, np.where(mask)[0] + 1))
[array([ 4.1 ,  4.05,  4.14,  4.01,  3.97]), array([ 4.52,  4.97,  5.02,  5.05,  5.2 ,  5.18]), array([ 3.66,  3.77,  3.59,  3.72])]

Or:

import numpy as np

diff = .5
x = np.array(l)
mask = np.abs(x[:-1] - x[1:]) <= diff
print(np.split(x, np.where(~mask)[0] + 1)
[array([ 4.1 ,  4.05,  4.14,  4.01,  3.97]), array([ 4.52,  4.97,  5.02,  5.05,  5.2 ,  5.18]), array([ 3.66,  3.77,  3.59,  3.72])]

Upvotes: 2

Gary van der Merwe
Gary van der Merwe

Reputation: 9533

from statistics import mean

def ordered_cluster(data, max_diff):
    current_group = ()
    for item in data:
        test_group = current_group + (item, )
        test_group_mean = mean(test_group)
        if all((abs(test_group_mean - test_item) < max_diff for test_item in test_group)):
            current_group = test_group
        else:
            yield current_group
            current_group = (item, )
    if current_group:
        yield current_group

data = [4.1, 4.05, 4.14, 4.01, 3.97, 4.52, 4.97, 5.02, 5.05, 5.2, 5.18, 3.66, 3.77, 3.59, 3.72]

print(list(ordered_cluster(data, 0.5)))

Output :

[(4.1, 4.05, 4.14, 4.01, 3.97, 4.52), (4.97, 5.02, 5.05, 5.2, 5.18), (3.66, 3.77, 3.59, 3.72)]

This ensures that each item from a group does not exceed max_diff to the mean of the group. If it does, a new group is started.

Upvotes: 3

Kasravnd
Kasravnd

Reputation: 107287

You can use itertools.groupby to categorize your data based on a specific difference (2 in this case) with those preceding item.

from itertools import groupby, chain
from collections import OrderedDict

def grouper(_lst, interval):
    z = zip(_lst,_lst[1:])
    return [OrderedDict.fromkeys(chain.from_iterable(g)).keys() for k,g in groupby(z,key=lambda x:x[1]-x[0]<interval) if k]

Here I used OrderedDict.fromkeys in order to preserver the unique items in a specific order.

Demo :

test = [0, 1.3, 2.2, 2.9, 6, 7.8, 8, 9.1, 10.4,15, 16, 17.6, 17.7, 18.9]
print(grouper(test, 2))
[[0, 1.3, 2.2, 2.9], [6, 7.8, 8, 9.1, 10.4], [15, 16, 17.6, 17.7, 18.9]]

Upvotes: 2

awesoon
awesoon

Reputation: 33671

You may use itertools.groupby - it combines consecutive elements with same result of given key function (round in this case):

In [7]: import itertools

In [8]: data = [4.1, 4.05, 4.14, 4.01, 3.97, 4.52, 4.97, 5.02, 5.05, 5.2, 5.18, 3.66, 3.77, 3.59, 3.72]

In [9]: [tuple(xs) for _, xs in itertools.groupby(data, round)]
Out[9]: 
[(4.1, 4.05, 4.14, 4.01, 3.97),
 (4.52, 4.97, 5.02, 5.05, 5.2, 5.18),
 (3.66, 3.77, 3.59, 3.72)]

Upvotes: 4

Related Questions