Reputation: 11680
I have a dataframe that can be reconstructed from the dict below.
The dataframe represents 23 statistics (X1-X23)
for various cities around the world. Each city occupies a single row in the dataframe with the 23 statistics as separate columns.
My actual df has ~6 million
cities so its a large dataframe.
What I want to do is:
Step#1: Identify clusters of cities based on the 23 statistics (X1-X23)
.
Step#2: Given the identified clusters in Step#1, I want to construct a portfolio of cities such that:
a) number of cities selected from any given cluster is limited (limit may be different for each cluster)
b) avoid certain clusters altogether
c) apply additional criteria to the portfolio selection such that the correlation of poor weather between cities in the portfolio is minimized and correlation of good weather between cities is maximized.
My problem set is such that the K for a K-means algo
would be quite large but I'm not sure what that value is.
I've been reading the following on clustering:
Cluster analysis in R: determine the optimal number of clusters
How do I determine k when using k-means clustering?
However, a lot of the literature is foreign to me and will take me months to understand. I'm not a data scientist and don't have the time to take a course on machine learning.
At this point I have the dataframe and am now twiddling my thumbs.
I'd be grateful if you can help me move forward in actually implementing Steps#1 to Steps#2
in pandas with an example dataset.
The dict below can be reconstructed to a dataframe by pd.DataFrame(x)
where x is the dict below:
Output of df.head().to_dict('rec'):
[{'X1': 123.40000000000001,
'X2': -67.900000000000006,
'X3': 172.0,
'X4': -2507.1999999999998,
'X5': 80.0,
'X6': 1692.0999999999999,
'X7': 13.5,
'X8': 136.30000000000001,
'X9': -187.09999999999999,
'X10': 50.0,
'X11': -822.0,
'X12': 13.0,
'X13': 260.80000000000001,
'X14': 14084.0,
'X15': -944.89999999999998,
'X16': 224.59999999999999,
'X17': -23.100000000000001,
'X18': -16.199999999999999,
'X19': 1825.9000000000001,
'X20': 710.70000000000005,
'X21': -16.199999999999999,
'X22': 1825.9000000000001,
'X23': 66.0,
'city': 'SFO'},
{'X1': -359.69999999999999,
'X2': -84.299999999999997,
'X3': 86.0,
'X4': -1894.4000000000001,
'X5': 166.0,
'X6': 882.39999999999998,
'X7': -19.0,
'X8': -133.30000000000001,
'X9': -84.799999999999997,
'X10': 27.0,
'X11': -587.29999999999995,
'X12': 36.0,
'X13': 332.89999999999998,
'X14': 825.20000000000005,
'X15': -3182.5,
'X16': -210.80000000000001,
'X17': 87.400000000000006,
'X18': -443.69999999999999,
'X19': -3182.5,
'X20': 51.899999999999999,
'X21': -443.69999999999999,
'X22': -722.89999999999998,
'X23': -3182.5,
'city': 'YYZ'},
{'X1': -24.800000000000001,
'X2': -34.299999999999997,
'X3': 166.0,
'X4': -2352.6999999999998,
'X5': 87.0,
'X6': 1941.3,
'X7': 56.600000000000001,
'X8': 120.2,
'X9': -65.400000000000006,
'X10': 44.0,
'X11': -610.89999999999998,
'X12': 19.0,
'X13': 414.80000000000001,
'X14': 4891.1999999999998,
'X15': -2396.0999999999999,
'X16': 181.59999999999999,
'X17': 177.0,
'X18': -92.900000000000006,
'X19': -2396.0999999999999,
'X20': 805.60000000000002,
'X21': -92.900000000000006,
'X22': -379.69999999999999,
'X23': -2396.0999999999999,
'city': 'DFW'},
{'X1': -21.300000000000001,
'X2': -47.399999999999999,
'X3': 166.0,
'X4': -2405.5999999999999,
'X5': 85.0,
'X6': 1836.8,
'X7': 55.700000000000003,
'X8': 130.80000000000001,
'X9': -131.09999999999999,
'X10': 47.0,
'X11': -690.60000000000002,
'X12': 16.0,
'X13': 297.30000000000001,
'X14': 5163.3999999999996,
'X15': -2446.4000000000001,
'X16': 182.30000000000001,
'X17': 83.599999999999994,
'X18': -36.0,
'X19': -2446.4000000000001,
'X20': 771.29999999999995,
'X21': -36.0,
'X22': -378.30000000000001,
'X23': -2446.4000000000001,
'city': 'PDX'},
{'X1': -22.399999999999999,
'X2': -9.0,
'X3': 167.0,
'X4': -2405.5999999999999,
'X5': 86.0,
'X6': 2297.9000000000001,
'X7': 41.0,
'X8': 109.7,
'X9': 64.900000000000006,
'X10': 42.0,
'X11': -558.29999999999995,
'X12': 21.0,
'X13': 753.10000000000002,
'X14': 5979.6999999999998,
'X15': -2370.1999999999998,
'X16': 187.40000000000001,
'X17': 373.10000000000002,
'X18': -224.30000000000001,
'X19': -2370.1999999999998,
'X20': 759.5,
'X21': -224.30000000000001,
'X22': -384.39999999999998,
'X23': -2370.1999999999998,
'city': 'EWR'}]
Upvotes: 1
Views: 5370
Reputation: 1156
I don't know what you mean by "for further processing" but here is a super simple explanation to get you started.
1) get the data into a dataframe (pandas) with the variables (x1-x23) across the top (column headers) and each row representing a different city (so that your df.head() shows x1-x23 for column headers).
2) standardize the variables
3) decide whether to use PCA before using Kmeans
4) use kmeans- scikit learn makes this part easy check this too and this
5) try this silhouette analysis for choosing the number of clusters to get a start
references that are good:
Hastie and Tibshirani book
Hastie and Tibshirani free course, but use R
Udacity, Coursera, EDX courses on machine learning
EDIT: forgot to mention, don't use your whole dataset while you are testing out the processes. Use a much smaller portion of the data (e.g. 100K cities) so that the processing time is much less until you get everything right.
Upvotes: 2