Kmeans clustering non-numeric column

Question

Here is my data set

>   0   1   2   3   4   5
> 
> 0 2020    14446999.0  300340.0    300287.0    2.0 お笑い
> 1 2020    12725811.0  300451.0    300445.0    2.0 格闘技
> 2 2020    15894610.0  300452.0    300451.0    3.0 ボクシング
> 3 2020    16334176.0  300445.0    0.0 1.0 スポーツ
> 4 2020    12725811.0  300451.0    300445.0    2.0 格闘技

Hello Everyone.

I have a datasets looks like this and I hope that I can cluster the column 5 which includes the interests of people.like 4 custering/groups, to see the main interests of people.

And the first column is date, column3 and 4 are ID. The question is that I searched a lot of example in kaggle. It seems like all Kmeans clustering example are based on data set which are numeric data. And my column 5 is Japanese word not English. Which upset me a lot. How can I do or anyone can share a link example for me? Thanks in advance.

Corralien · Accepted Answer

You can use pd.factorize to convert you str columns to numeric:

Input dataframe

>>> df
      1           2         3         4    5      6
0  2020  14446999.0  300340.0  300287.0  2.0    お笑い
1  2020  12725811.0  300451.0  300445.0  2.0    格闘技
2  2020  15894610.0  300452.0  300451.0  3.0  ボクシング
3  2020  16334176.0  300445.0       0.0  1.0   スポーツ
4  2020  12725811.0  300451.0  300445.0  2.0    格闘技

df[6] = pd.factorize(df[6])[0]

Output result

>>> df
      1           2         3         4    5  6
0  2020  14446999.0  300340.0  300287.0  2.0  0
1  2020  12725811.0  300451.0  300445.0  2.0  1
2  2020  15894610.0  300452.0  300451.0  3.0  2
3  2020  16334176.0  300445.0       0.0  1.0  3
4  2020  12725811.0  300451.0  300445.0  2.0  1

Kmeans clustering non-numeric column

Answers (1)

Related Questions