tmn103
tmn103

Reputation: 339

Using k-means clustering to cluster based on single variable

I’m just trying to get my head around clustering.

I have a series of data points - y - which have a noise function associated with them (gaussian)

There are two classes of values 0 and >0 (obviously with noise). I’m trying to find the centre point of the group which is >0.

I’ve plotted the points with a simple moving average to be able to eyeball the data.

Moving average plot:

enter image description here

How can I cluster the data just based on the y value?

I’d like to have two clusters - one covering the points on the left and right (roughly <120 and >260 by the looks of it) and the other for the middle points (x = 120 to 260)

If I try with two clusters I get this:

k means plot - k=2:

enter image description here

How should I amend my code to achieve this?

x = range(315)
y= [-0.0019438692324050865, 0.0028994208839327852, 0.0051483573976274649, -0.0033242993359676809, -0.007205517954705391, 0.0023493638544448323, 0.0021109981155292179, 0.0035990200904119076, -0.0039516797159245328, 0.0046512034107712786, -0.0019248189368846083, 0.0036744109953683823, 0.0007898612768152954, 0.0050059088808496474, -0.0021084425769681558, 0.0014692258570182986, -0.0030711206115484175, -0.0026614801222815628, 0.0022816301256991535, 0.00019923934682088178, -0.0013181161659271139, -0.0021956355547661358, 0.0012941895041076283, 0.00337197586896105, -0.0019792508536746402, -0.002020497762984554, 0.0014495021773240431, 0.0011887337096206894, 0.0016667792145975404, -0.0010119590445208419, -0.0024506337087077676, 0.0072264471843846339, -0.0014126073097276062, -0.00065673498034648755, -0.0011355352304356647, -0.00042657980930307281, -0.0032875547481258042, -0.002351265010099495, -0.00073344218847348742, -0.0031555991687002589, 0.0026170287799315104, 0.0019289080666337198, -0.0021804765064623076, 0.0026221290350876979, 0.0019831827145683828, -0.005422907223254632, -0.0014107046201467732, -0.0049438583709020423, 0.00081884635937855494, 0.0054783747880986361, -0.0011282600170147909, -0.00436581779762948, 0.0024421851848953177, -0.0018564229613786095, -0.0052492274840120123, 0.0051775747035086306, 0.0052413417491534494, 0.0030817295096650732, -0.0014106391941506153, 0.00074380887788818206, -0.0041507550699856439, -0.00074928547462217287, -9.3938667619130614e-05, -0.00060592968804004362, 0.0064913597798387348, 0.0018098075166183621, 0.00099550852535854441, 0.0037322288350247917, 0.0027039351321340869, 0.0060238021513650541, -0.006567405116575234, 0.0020858553839503175, -0.0040329574871009084, -0.0029337227854833213, 0.0020743996957790969, 0.0041249738085716511, -0.0016678673351373336, -0.00081387164524554967, -0.0028411340446090278, 0.00013572776045231967, -0.00025350369023925548, 0.00071609777542998309, -0.0018427036825796074, -0.0015513575887011904, -0.0016357115978466398, 0.0038235991426514866, 0.0017693050063256977, -0.00029816429542494152, -0.0016071303644783605, -0.0031883070092131086, -0.0010340123778528594, -0.0049194467790889653, 0.0012109237666701397, 0.0024532524488299246, 0.0069307209537693721, 0.0009573350812806618, -6.0022322637651027e-05, -0.00050143013334696311, 0.0023415017810229548, 0.0033053845403900849, -0.0061156769150035222, 0.00022216114877491691, 0.0017257349557975464, 4.6919738262423826e-05, -0.0035257466102171162, -0.0043673831041441185, -0.0016592116617178102, -0.003298933045964781, -0.001667158964114637, 0.0011283739877531254, -0.0055098513985193534, 0.0023564462221116358, 0.0041971132878626258, 0.0061727231077443314, 0.0047583822927202779, 0.0022475414486232245, 0.0048682822792560521, 0.0022415648209199016, 0.00044859963858686957, -0.0018519391698513449, 0.0031460918774998763, 0.0038614233082916809, -0.0043409564348247066, -0.0055560805453666326, -0.00025133196059449212, 0.012436346397552794, 0.01136022093203152, 0.011244278807602391, 0.01470018209739289, 0.0075560289478025277, 0.012568781764361209, 0.0076068752709663838, 0.011022209533236597, 0.010545997929846045, 0.01084340614623565, 0.011728388118710915, 0.0075043238708055885, 0.012860298948366296, 0.0097297636410632864, 0.0098800557729756874, 0.011536517297700085, 0.0082316420968713416, 0.012612386004592427, 0.016617154743589352, 0.0091391582296167315, 0.014952150276251052, 0.011675391002362373, 0.01568297072839233, 0.01537664322062633, 0.01622711654371662, 0.010708828344561546, 0.016625354383482532, 0.010757807468539406, 0.016867909081979202, 0.010354635736138377, 0.014345365677006765, 0.011114328315579219, 0.010034249196973242, 0.015846180181371881, 0.014303841146954242, 0.011608682896746103, 0.0086826955459553216, 0.0088576104599897426, 0.011250553207393772, 0.005522552439745569, 0.011185993425936373, 0.010241377537878162, 0.0079206732150164348, 0.0052965651546758108, 0.011104715912291204, 0.010506408714857187, 0.010153282642128673, 0.010286986015082572, 0.01187330766677645, 0.014541420264499783, 0.013092204890199896, 0.012979246400649271, 0.012595814351669916, 0.014714607377710237, 0.011727516021525658, 0.011035077266739704, 0.0089698030032708698, 0.0087245475140550147, 0.011139467365240661, 0.0094505568595650603, 0.014430361388952871, 0.0089241578716030695, 0.014616210804585136, 0.013295072783119581, 0.014430633057603408, 0.01200577022494694, 0.011315388654675421, 0.013359877656434442, 0.017704146495248471, 0.0089900858719559155, 0.014731590728415532, 0.0053244009632545759, 0.011199377929150522, 0.0098899254166580439, 0.012220397221188688, 0.015315682643295272, 0.0042842773538990919, 0.0098560854848898077, 0.0088592602102698509, 0.011682575531316278, 0.0098450268165344631, 0.015508017179782136, 0.0083959771972897564, 0.0057504382506886418, 0.010149849298310511, 0.011467172305959087, 0.019354427705224483, 0.013200207481702888, 0.0084555200083286791, 0.011458643458455485, 0.0067582116806278788, 0.01083616691886825, 0.013189184991857963, 0.011774794518724967, 0.014419252448288828, 0.011252283438046358, 0.013346699363583018, 0.0070752340082163006, 0.013215300343131422, 0.0083841320189162287, 0.0067600805611729283, 0.014043517055899181, 0.0098241497159076551, 0.011466675085574904, 0.01155354571355972, 0.012051701509217881, 0.010150596813866767, 0.0093930906430917619, 0.003368481869910186, 0.0048359029438027378, 0.0072083852964288445, 0.010112266453748613, 0.014009345326404186, 0.0050187514558796657, 0.0076315122645601551, 0.0098572381625301152, 0.0114902035403828, 0.018390212262653569, 0.020552166087412803, 0.010428735773226807, 0.011717974670325962, 0.011586303572796604, 0.0092978832913345726, 0.0040060048273946845, 0.012302496528511328, 0.0076707934776137684, 0.014700766223305586, 0.013491092168119941, 0.016244916923257174, 0.010387716692694397, 0.0072564046806323553, 0.0089420045528720883, 0.012125390630607462, 0.013274623392811291, 0.012783388635585766, 0.013859113028817658, 0.0080975189401925642, 0.01379241865445455, 0.012648552766643405, 0.011380280655911323, 0.010109646424218717, 0.0098577688652478051, 0.0064661895943772208, 0.010848835432253455, -0.0010986941731458047, -0.00052875821639583262, 0.0020423603076171414, 0.0035710440970171805, 0.001652886517437206, 0.0023512717524485573, -0.002695275440737862, 0.002253880812688683, -0.0080855104018828141, -0.0020090808966136161, -0.0029794078852333791, 0.00047537441103425869, -0.0010168825525621432, 0.0028683012479151873, -0.0014733214239664142, 0.0019432702158397569, -0.0012411849653504801, -0.00034507088510895141, -0.0023587874349834145, 0.0018156591123708393, 0.0040923006067568324, 0.0043522232127477072, -0.0055992642684123371, -0.0019368557792245147, 0.0026257395447205848, 0.0025594329536029635, 0.00053681548609292378, 0.0032186216144045742, -0.003338121135450386, 0.00065996843114729585, 0.006711173245189642, 0.0032877327776177517, 0.0039528629317296367, 0.0063732674764248719, -0.0026207617244284023, 0.0061381482567009048, -0.003024741769256066, -0.0023891419421980839, -0.004011235930513047, 0.0018372067754070733, -0.0045928077859572689, -0.0021420171112169601, 0.001665179522797816, 0.0074356736689407859, 0.0065680163280897891, -0.0038116640825467678]

data = np.column_stack([x,y])
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)
y_kmeans = kmeans.predict(data)
plt.scatter(data[:, 0], data[:, 1], c=y_kmeans, s=5, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
plt.grid()

I’d also like to be able to return the max, min and average for the values in each cluster - is this possible?

Upvotes: 4

Views: 13642

Answers (3)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77505

You can reshape your data to a n x 1 array.

But if you want to take the time into account, I suggest you look into change detection in time series instead. It can detect a change in mean.

Upvotes: 1

CodeZero
CodeZero

Reputation: 1699

Some ideas on your problem.

k-means is actually a multivariate method, so it is probably not a good choice in your case. You can take advantage of the 1-dimensionality of you data by looking for minima of a kernel density estimation of the y-data. A plot of the density estimation will show a bimodal density function with the two modes divided by a minimum which is the y-value at which you want to divide the two clusters. Have a look at http://scikit-learn.org/stable/modules/density.html#kernel-density

To get the x-values at which you divide, you could use the moving average you already computed.

However, there might be methods better suited to your kind of data. You might want to ask your question at https://stats.stackexchange.com/ as it is not really a programming problem but one about the appropriate method.

Upvotes: 1

Stev
Stev

Reputation: 1150

Using your code, the simplest way to get what you want is to change:

kmeans.fit(data)
y_kmeans = kmeans.predict(data)

to

kmeans.fit(data[:,1].reshape(-1,1))
y_kmeans = kmeans.predict(data[:,1].reshape(-1,1))

You can get max, min, mean etc by using index, for example:

np.max(data[:,1][y_kmeans == 1])

Upvotes: 1

Related Questions