k-means / x-means (or other?) clustering in pandas/python

Question

I have a dataframe that can be reconstructed from the dict below.

The dataframe represents 23 statistics (X1-X23) for various cities around the world. Each city occupies a single row in the dataframe with the 23 statistics as separate columns.

My actual df has ~6 million cities so its a large dataframe.

What I want to do is:

Step#1: Identify clusters of cities based on the 23 statistics (X1-X23).

Step#2: Given the identified clusters in Step#1, I want to construct a portfolio of cities such that:

a) number of cities selected from any given cluster is limited (limit may be different for each cluster)

b) avoid certain clusters altogether

c) apply additional criteria to the portfolio selection such that the correlation of poor weather between cities in the portfolio is minimized and correlation of good weather between cities is maximized.

My problem set is such that the K for a K-means algo would be quite large but I'm not sure what that value is.

I've been reading the following on clustering:

Cluster analysis in R: determine the optimal number of clusters

How do I determine k when using k-means clustering?

X-means: Extending K-means...

However, a lot of the literature is foreign to me and will take me months to understand. I'm not a data scientist and don't have the time to take a course on machine learning.

At this point I have the dataframe and am now twiddling my thumbs.

I'd be grateful if you can help me move forward in actually implementing Steps#1 to Steps#2 in pandas with an example dataset.

The dict below can be reconstructed to a dataframe by pd.DataFrame(x) where x is the dict below:

Output of df.head().to_dict('rec'):

[{'X1': 123.40000000000001,
  'X2': -67.900000000000006,
  'X3': 172.0,
  'X4': -2507.1999999999998,
  'X5': 80.0,
  'X6': 1692.0999999999999,
  'X7': 13.5,
  'X8': 136.30000000000001,
  'X9': -187.09999999999999,
  'X10': 50.0,
  'X11': -822.0,
  'X12': 13.0,
  'X13': 260.80000000000001,
  'X14': 14084.0,
  'X15': -944.89999999999998,
  'X16': 224.59999999999999,
  'X17': -23.100000000000001,
  'X18': -16.199999999999999,
  'X19': 1825.9000000000001,
  'X20': 710.70000000000005,
  'X21': -16.199999999999999,
  'X22': 1825.9000000000001,
  'X23': 66.0,
  'city': 'SFO'},
 {'X1': -359.69999999999999,
  'X2': -84.299999999999997,
  'X3': 86.0,
  'X4': -1894.4000000000001,
  'X5': 166.0,
  'X6': 882.39999999999998,
  'X7': -19.0,
  'X8': -133.30000000000001,
  'X9': -84.799999999999997,
  'X10': 27.0,
  'X11': -587.29999999999995,
  'X12': 36.0,
  'X13': 332.89999999999998,
  'X14': 825.20000000000005,
  'X15': -3182.5,
  'X16': -210.80000000000001,
  'X17': 87.400000000000006,
  'X18': -443.69999999999999,
  'X19': -3182.5,
  'X20': 51.899999999999999,
  'X21': -443.69999999999999,
  'X22': -722.89999999999998,
  'X23': -3182.5,
  'city': 'YYZ'},
 {'X1': -24.800000000000001,
  'X2': -34.299999999999997,
  'X3': 166.0,
  'X4': -2352.6999999999998,
  'X5': 87.0,
  'X6': 1941.3,
  'X7': 56.600000000000001,
  'X8': 120.2,
  'X9': -65.400000000000006,
  'X10': 44.0,
  'X11': -610.89999999999998,
  'X12': 19.0,
  'X13': 414.80000000000001,
  'X14': 4891.1999999999998,
  'X15': -2396.0999999999999,
  'X16': 181.59999999999999,
  'X17': 177.0,
  'X18': -92.900000000000006,
  'X19': -2396.0999999999999,
  'X20': 805.60000000000002,
  'X21': -92.900000000000006,
  'X22': -379.69999999999999,
  'X23': -2396.0999999999999,
  'city': 'DFW'},
 {'X1': -21.300000000000001,
  'X2': -47.399999999999999,
  'X3': 166.0,
  'X4': -2405.5999999999999,
  'X5': 85.0,
  'X6': 1836.8,
  'X7': 55.700000000000003,
  'X8': 130.80000000000001,
  'X9': -131.09999999999999,
  'X10': 47.0,
  'X11': -690.60000000000002,
  'X12': 16.0,
  'X13': 297.30000000000001,
  'X14': 5163.3999999999996,
  'X15': -2446.4000000000001,
  'X16': 182.30000000000001,
  'X17': 83.599999999999994,
  'X18': -36.0,
  'X19': -2446.4000000000001,
  'X20': 771.29999999999995,
  'X21': -36.0,
  'X22': -378.30000000000001,
  'X23': -2446.4000000000001,
  'city': 'PDX'},
 {'X1': -22.399999999999999,
  'X2': -9.0,
  'X3': 167.0,
  'X4': -2405.5999999999999,
  'X5': 86.0,
  'X6': 2297.9000000000001,
  'X7': 41.0,
  'X8': 109.7,
  'X9': 64.900000000000006,
  'X10': 42.0,
  'X11': -558.29999999999995,
  'X12': 21.0,
  'X13': 753.10000000000002,
  'X14': 5979.6999999999998,
  'X15': -2370.1999999999998,
  'X16': 187.40000000000001,
  'X17': 373.10000000000002,
  'X18': -224.30000000000001,
  'X19': -2370.1999999999998,
  'X20': 759.5,
  'X21': -224.30000000000001,
  'X22': -384.39999999999998,
  'X23': -2370.1999999999998,
  'city': 'EWR'}]

k-means / x-means (or other?) clustering in pandas/python

Answers (1)

Related Questions