Reputation: 141
I have lists of numbers that I'd like to group by similarity. The order of the numbers in the list is fixed and important to preserve.
As an example, here's a visualisation of what I'm trying to achieve:
The black line represents the list of numbers I have. The green lines represent the groupings I would like to identify in this example list.
The order of numbers in the list is important and cannot be changed (e.g. cannot sort ascending or descending). The numbers in the list are not contiguous (i.e. there isn't likely to be a list of 6, 6, 6, 6
, but probably would be something like 5.85, 6.1, 5.96, 5.88
).
Is there a method to do this?
Edit: example values, and desired groupings:
[4.1, 4.05, 4.14, 4.01, 3.97, 4.52, 4.97, 5.02, 5.05, 5.2, 5.18, 3.66, 3.77, 3.59, 3.72]
would result in an approximate grouping of
[(4.1, 4.05, 4.14, 4.01, 3.97, 4.52), (4.97, 5.02, 5.05, 5.2, 5.18), (3.66, 3.77, 3.59, 3.72)]
In the grouping above, you could argue that 4.52
could belong to the first or second group. If visualised as I did in the example above, the groupings would be represented by the green lines. My lists are actually several hundred to several thousand values in length.
Upvotes: 2
Views: 927
Reputation: 11
https://en.wikipedia.org/wiki/K-means_clustering k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
Upvotes: 1
Reputation: 180411
A numpy version:
l = [4.1, 4.05, 4.14, 4.01, 3.97, 4.52, 4.97, 5.02, 5.05, 5.2, 5.18, 3.66, 3.77, 3.59, 3.72]
import numpy as np
x = np.array(l)
mask = np.diff(np.round(x))
print(np.split(x, np.where(mask)[0] + 1))
[array([ 4.1 , 4.05, 4.14, 4.01, 3.97]), array([ 4.52, 4.97, 5.02, 5.05, 5.2 , 5.18]), array([ 3.66, 3.77, 3.59, 3.72])]
Or:
import numpy as np
diff = .5
x = np.array(l)
mask = np.abs(x[:-1] - x[1:]) <= diff
print(np.split(x, np.where(~mask)[0] + 1)
[array([ 4.1 , 4.05, 4.14, 4.01, 3.97]), array([ 4.52, 4.97, 5.02, 5.05, 5.2 , 5.18]), array([ 3.66, 3.77, 3.59, 3.72])]
Upvotes: 2
Reputation: 9533
from statistics import mean
def ordered_cluster(data, max_diff):
current_group = ()
for item in data:
test_group = current_group + (item, )
test_group_mean = mean(test_group)
if all((abs(test_group_mean - test_item) < max_diff for test_item in test_group)):
current_group = test_group
else:
yield current_group
current_group = (item, )
if current_group:
yield current_group
data = [4.1, 4.05, 4.14, 4.01, 3.97, 4.52, 4.97, 5.02, 5.05, 5.2, 5.18, 3.66, 3.77, 3.59, 3.72]
print(list(ordered_cluster(data, 0.5)))
Output :
[(4.1, 4.05, 4.14, 4.01, 3.97, 4.52), (4.97, 5.02, 5.05, 5.2, 5.18), (3.66, 3.77, 3.59, 3.72)]
This ensures that each item from a group does not exceed max_diff
to the mean of the group. If it does, a new group is started.
Upvotes: 3
Reputation: 107287
You can use itertools.groupby
to categorize your data based on a specific difference (2
in this case) with those preceding item.
from itertools import groupby, chain
from collections import OrderedDict
def grouper(_lst, interval):
z = zip(_lst,_lst[1:])
return [OrderedDict.fromkeys(chain.from_iterable(g)).keys() for k,g in groupby(z,key=lambda x:x[1]-x[0]<interval) if k]
Here I used OrderedDict.fromkeys
in order to preserver the unique items in a specific order.
Demo :
test = [0, 1.3, 2.2, 2.9, 6, 7.8, 8, 9.1, 10.4,15, 16, 17.6, 17.7, 18.9]
print(grouper(test, 2))
[[0, 1.3, 2.2, 2.9], [6, 7.8, 8, 9.1, 10.4], [15, 16, 17.6, 17.7, 18.9]]
Upvotes: 2
Reputation: 33671
You may use itertools.groupby
- it combines consecutive elements with same result of given key function (round
in this case):
In [7]: import itertools
In [8]: data = [4.1, 4.05, 4.14, 4.01, 3.97, 4.52, 4.97, 5.02, 5.05, 5.2, 5.18, 3.66, 3.77, 3.59, 3.72]
In [9]: [tuple(xs) for _, xs in itertools.groupby(data, round)]
Out[9]:
[(4.1, 4.05, 4.14, 4.01, 3.97),
(4.52, 4.97, 5.02, 5.05, 5.2, 5.18),
(3.66, 3.77, 3.59, 3.72)]
Upvotes: 4