Reputation: 359
I'm using sklearn and agglomerative clustering function. I have a mixed data which includes both numeric and nominal data columns. My nominal columns have values such that "Morning", "Afternoon", "Evening", "Night". If I convert my nominal data to numeric by assigning integer values like 0,1,2,3; euclidean distance will be calculated as 3 between "Night" and "Morning", but, 1 should be return value as a distance.
X = pd.read_csv("mydata.csv", sep=",", header=0, encoding="utf-8")
X = StandardScaler().fit_transform(X)
print("n_samples: %d, n_features: %d" % X.shape)
km = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='average')
km.fit(X)
print("k = %d, Silhouette Coefficient: %0.3f" % (x,
metrics.silhouette_score(X, km.labels_, sample_size=None)))
Here is my code.
How can I customize the distance function in sklearn or convert my nominal data to numeric?
Upvotes: 8
Views: 21321
Reputation: 423
I think you have 3 options how to convert categorical features to numerical:
Code:
def two_hot(x):
return np.concatenate([
(x == "morning") | (x == "afternoon"),
(x == "afternoon") | (x == "evening"),
(x == "evening") | (x == "night"),
(x == "night") | (x == "morning"),
], axis=1).astype(int)
x = np.array([["morning", "afternoon", "evening", "night"]]).T
print(x)
x = two_hot(x)
print(x)
Output:
[['morning']
['afternoon']
['evening']
['night']]
[[1 0 0 1]
[1 1 0 0]
[0 1 1 0]
[0 0 1 1]]
Then we can measure the distances:
from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(x)
Output:
array([[0. , 1.41421356, 2. , 1.41421356],
[1.41421356, 0. , 1.41421356, 2. ],
[2. , 1.41421356, 0. , 1.41421356],
[1.41421356, 2. , 1.41421356, 0. ]])
Upvotes: 7
Reputation: 553
This problem is common to machine learning applications. You need to define one category as the base category (it doesn't matter which) then define indicator variables (0 or 1) for each of the other categories. In other words, create 3 new variables called "Morning", "Afternoon", and "Evening", and assign a one to whichever category each observation has. If it's a night observation, leave each of these new variables as 0.
Upvotes: 2