K Means Clustering - Handling Non-Numerical Data

Question

I have twitter data that I want to cluster. It is text data and I learned that K means can not handle Non-Numerical data. I wanted to cluster data just on the basis of the tweets. The data looks like this.

I found this code that can converts the text into numerical data.

def handle_non_numerical_data(df):
    columns = df.columns.values

    for column in columns:
        text_digit_vals = {}
        def convert_to_int(val):
            return text_digit_vals[val]

        if df[column].dtype != np.int64 and df[column].dtype != np.float64:
            column_contents = df[column].values.tolist()
            unique_elements = set(column_contents)
            x = 0
            for unique in unique_elements:
                if unique not in text_digit_vals:
                    text_digit_vals[unique] = x
                    x += 1

            df[column] = list(map(convert_to_int, df[column]))

    return df

df  = handle_non_numerical_data(data)
print(df.head())

output

   label  tweet
0      9     24
1      5     11
2     17     45
3     14    138
4     18    112

I'm quite new to this and I don't think this is what I need to fit the data. What is a better way to handle Non-Numerical data (text) of this nature?

Edit: When running K means clustering Algorithm on raw text data I get this error.

ValueError: could not convert string to float

Arturo Sbr · Accepted Answer

The most typical way of handling non-numerical data is to convert a single column into multiple binary columns. This is called "getting dummy variables" or a "one hot encoding" (among many other snobby terms).

There are other things you can do to translate the data to numbers, such as sentiment analysis (i.e. cetagorize each tweet into happy, sad, funny, angry, etc...), analyzing the tweets to determine if they are about a certain subject or not (i.e. Does this tweet talk about a virus?), the number of words in each tweet, the number of spaces per tweet, if it has good grammar or not, etc. As you can see, you are asking about a very broad subject.

When transforming data to binary columns, you get the number of unique values in your column and make that many new columns, each one of them filled with zeros and ones.

Let's focus on your first column:

import pandas as pd
df = pd.DataFrame({'account':['realdonaldtrump','naredramodi','pontifex','pmoindia','potus']})

    account
0   realdonaldtrump
1   narendramodi
2   pontifex
3   pmoindia
4   potus

This is equivalent to:

pd.get_dummies(df, columns=['account'], prefix='account')

   account_naredramodi  account_pmoindia  account_pontifex  account_potus  \
0                    0                 0                 0              0   
1                    1                 0                 0              0   
2                    0                 0                 1              0   
3                    0                 1                 0              0   
4                    0                 0                 0              1   

   account_realdonaldtrump  
0                        1  
1                        0  
2                        0  
3                        0  
4                        0

This is one of many methods. You can check out this article about one hot encoding here.

NOTE: When you have many unique values, doing this will give you many columns and some algorithms will crash due to not having enough degrees of freedom (too many variables, not enough observations). Last, if you are running a regression, you will run into perfect multicollinearity if you do not drop one of the columns.

Going back to your example, if you want to turn all your columns into this kind of data, try:

pd.get_dummies(df)

However, I wouldn't do this for the tweet column because each tweet is its own unique value.

K Means Clustering - Handling Non-Numerical Data

Answers (2)

Related Questions