Reputation: 25
Here is my data set
> 0 1 2 3 4 5
>
> 0 2020 14446999.0 300340.0 300287.0 2.0 お笑い
> 1 2020 12725811.0 300451.0 300445.0 2.0 格闘技
> 2 2020 15894610.0 300452.0 300451.0 3.0 ボクシング
> 3 2020 16334176.0 300445.0 0.0 1.0 スポーツ
> 4 2020 12725811.0 300451.0 300445.0 2.0 格闘技
Hello Everyone.
I have a datasets looks like this and I hope that I can cluster the column 5 which includes the interests of people.like 4 custering/groups, to see the main interests of people.
And the first column is date, column3 and 4 are ID. The question is that I searched a lot of example in kaggle. It seems like all Kmeans clustering example are based on data set which are numeric data. And my column 5 is Japanese word not English. Which upset me a lot. How can I do or anyone can share a link example for me? Thanks in advance.
Upvotes: 1
Views: 396
Reputation: 120519
You can use pd.factorize
to convert you str columns to numeric:
Input dataframe
>>> df
1 2 3 4 5 6
0 2020 14446999.0 300340.0 300287.0 2.0 お笑い
1 2020 12725811.0 300451.0 300445.0 2.0 格闘技
2 2020 15894610.0 300452.0 300451.0 3.0 ボクシング
3 2020 16334176.0 300445.0 0.0 1.0 スポーツ
4 2020 12725811.0 300451.0 300445.0 2.0 格闘技
df[6] = pd.factorize(df[6])[0]
Output result
>>> df
1 2 3 4 5 6
0 2020 14446999.0 300340.0 300287.0 2.0 0
1 2020 12725811.0 300451.0 300445.0 2.0 1
2 2020 15894610.0 300452.0 300451.0 3.0 2
3 2020 16334176.0 300445.0 0.0 1.0 3
4 2020 12725811.0 300451.0 300445.0 2.0 1
Upvotes: 2