codingknob
codingknob

Reputation: 11680

k-means / x-means (or other?) clustering in pandas/python

I have a dataframe that can be reconstructed from the dict below.

The dataframe represents 23 statistics (X1-X23) for various cities around the world. Each city occupies a single row in the dataframe with the 23 statistics as separate columns.

My actual df has ~6 million cities so its a large dataframe.

What I want to do is:

Step#1: Identify clusters of cities based on the 23 statistics (X1-X23).

Step#2: Given the identified clusters in Step#1, I want to construct a portfolio of cities such that:

a) number of cities selected from any given cluster is limited (limit may be different for each cluster)

b) avoid certain clusters altogether

c) apply additional criteria to the portfolio selection such that the correlation of poor weather between cities in the portfolio is minimized and correlation of good weather between cities is maximized.

My problem set is such that the K for a K-means algo would be quite large but I'm not sure what that value is.

I've been reading the following on clustering:

Cluster analysis in R: determine the optimal number of clusters

How do I determine k when using k-means clustering?

X-means: Extending K-means...

However, a lot of the literature is foreign to me and will take me months to understand. I'm not a data scientist and don't have the time to take a course on machine learning.

At this point I have the dataframe and am now twiddling my thumbs.

I'd be grateful if you can help me move forward in actually implementing Steps#1 to Steps#2 in pandas with an example dataset.

The dict below can be reconstructed to a dataframe by pd.DataFrame(x) where x is the dict below:

Output of df.head().to_dict('rec'):

[{'X1': 123.40000000000001,
  'X2': -67.900000000000006,
  'X3': 172.0,
  'X4': -2507.1999999999998,
  'X5': 80.0,
  'X6': 1692.0999999999999,
  'X7': 13.5,
  'X8': 136.30000000000001,
  'X9': -187.09999999999999,
  'X10': 50.0,
  'X11': -822.0,
  'X12': 13.0,
  'X13': 260.80000000000001,
  'X14': 14084.0,
  'X15': -944.89999999999998,
  'X16': 224.59999999999999,
  'X17': -23.100000000000001,
  'X18': -16.199999999999999,
  'X19': 1825.9000000000001,
  'X20': 710.70000000000005,
  'X21': -16.199999999999999,
  'X22': 1825.9000000000001,
  'X23': 66.0,
  'city': 'SFO'},
 {'X1': -359.69999999999999,
  'X2': -84.299999999999997,
  'X3': 86.0,
  'X4': -1894.4000000000001,
  'X5': 166.0,
  'X6': 882.39999999999998,
  'X7': -19.0,
  'X8': -133.30000000000001,
  'X9': -84.799999999999997,
  'X10': 27.0,
  'X11': -587.29999999999995,
  'X12': 36.0,
  'X13': 332.89999999999998,
  'X14': 825.20000000000005,
  'X15': -3182.5,
  'X16': -210.80000000000001,
  'X17': 87.400000000000006,
  'X18': -443.69999999999999,
  'X19': -3182.5,
  'X20': 51.899999999999999,
  'X21': -443.69999999999999,
  'X22': -722.89999999999998,
  'X23': -3182.5,
  'city': 'YYZ'},
 {'X1': -24.800000000000001,
  'X2': -34.299999999999997,
  'X3': 166.0,
  'X4': -2352.6999999999998,
  'X5': 87.0,
  'X6': 1941.3,
  'X7': 56.600000000000001,
  'X8': 120.2,
  'X9': -65.400000000000006,
  'X10': 44.0,
  'X11': -610.89999999999998,
  'X12': 19.0,
  'X13': 414.80000000000001,
  'X14': 4891.1999999999998,
  'X15': -2396.0999999999999,
  'X16': 181.59999999999999,
  'X17': 177.0,
  'X18': -92.900000000000006,
  'X19': -2396.0999999999999,
  'X20': 805.60000000000002,
  'X21': -92.900000000000006,
  'X22': -379.69999999999999,
  'X23': -2396.0999999999999,
  'city': 'DFW'},
 {'X1': -21.300000000000001,
  'X2': -47.399999999999999,
  'X3': 166.0,
  'X4': -2405.5999999999999,
  'X5': 85.0,
  'X6': 1836.8,
  'X7': 55.700000000000003,
  'X8': 130.80000000000001,
  'X9': -131.09999999999999,
  'X10': 47.0,
  'X11': -690.60000000000002,
  'X12': 16.0,
  'X13': 297.30000000000001,
  'X14': 5163.3999999999996,
  'X15': -2446.4000000000001,
  'X16': 182.30000000000001,
  'X17': 83.599999999999994,
  'X18': -36.0,
  'X19': -2446.4000000000001,
  'X20': 771.29999999999995,
  'X21': -36.0,
  'X22': -378.30000000000001,
  'X23': -2446.4000000000001,
  'city': 'PDX'},
 {'X1': -22.399999999999999,
  'X2': -9.0,
  'X3': 167.0,
  'X4': -2405.5999999999999,
  'X5': 86.0,
  'X6': 2297.9000000000001,
  'X7': 41.0,
  'X8': 109.7,
  'X9': 64.900000000000006,
  'X10': 42.0,
  'X11': -558.29999999999995,
  'X12': 21.0,
  'X13': 753.10000000000002,
  'X14': 5979.6999999999998,
  'X15': -2370.1999999999998,
  'X16': 187.40000000000001,
  'X17': 373.10000000000002,
  'X18': -224.30000000000001,
  'X19': -2370.1999999999998,
  'X20': 759.5,
  'X21': -224.30000000000001,
  'X22': -384.39999999999998,
  'X23': -2370.1999999999998,
  'city': 'EWR'}]

Upvotes: 1

Views: 5370

Answers (1)

ivan7707
ivan7707

Reputation: 1156

I don't know what you mean by "for further processing" but here is a super simple explanation to get you started.

1) get the data into a dataframe (pandas) with the variables (x1-x23) across the top (column headers) and each row representing a different city (so that your df.head() shows x1-x23 for column headers).

2) standardize the variables

3) decide whether to use PCA before using Kmeans

4) use kmeans- scikit learn makes this part easy check this too and this

5) try this silhouette analysis for choosing the number of clusters to get a start

references that are good:
Hastie and Tibshirani book

Hastie and Tibshirani free course, but use R

Udacity, Coursera, EDX courses on machine learning

EDIT: forgot to mention, don't use your whole dataset while you are testing out the processes. Use a much smaller portion of the data (e.g. 100K cities) so that the processing time is much less until you get everything right.

Upvotes: 2

Related Questions