bolkhovsky
bolkhovsky

Reputation: 120

How to find clusters of values in numpy array

I have an array (M x N) of air pressure data (gridded model data). There's also two arrays (also M x N) for latitudes and longitudes. To build a GeoJSON of isobars (surfaces of equal pressure) I need to find clusters of pressure values with given step (1 Pa, 0.5 Pa). In general I was thinking to solve it like that:

  1. Build a list of objects: [{ lat, lon, pressure },..] to keep lat and lon data linked to a pressure;
  2. Sort objects by pressure;
  3. For each object in list: compare its pressure value and move to a dedicated list;
  4. Create GeoJSON features.

But step 3 is not yet clear to me: how to find clusters in a smart way? Which algorithm should I look for? Can I do that with scipy.cluster package?

Upvotes: 2

Views: 5677

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77505

I don't think you are looking for cluster at all.

Apparently the isobar ranges are given. So split your data set on them; you do not need to sort for this - just find the minimum and maximum to get all buckets, then select data according to each bucket separately. This breaks the problem down nicely into smaller chunks.

I guess your problem is largely a visualization one. You want to display areas of similar pressure instead of points, right?

Instead of looking at statistical methods such as least-squares optimization (k-means), which require you to predefine the parameter k, consider looking at visualization techniques such as Alpha Shapes (closely related to convex hulls, but they also allow non-convex shapes). If you compute alpha shapes for each of your pressure domains, you should get a nice visualization of these regions.

If you insist on using clustering, have a look at DBSCAN. Mostly for the reason that it allows non-convex shaped clusters, and that it can work with latitude+longitude (k-means doesn't). But even HAC may be able to give you good results, since you can define your cut threshold based on your data resolution (e.g. merge any points - in the same pressure bucket - if they are less than 1km apart).

Upvotes: 1

Related Questions