Encoding multiple columns in pandas

Question

I have some doubts regarding encoding (I am not familiar with tasks like these) categorical variables in order to use them as parameters in a model like logistic regression or SVM. My dataset looks like the following:

Text                                  Symbol    Note    Account    Age   Label 
There is a red car                      !        red      John    24   1
My bag was very expensive               ?       orange    Luke    36  0
Where are my keys?                      @        red      Red     58  1
I promise: I will never let you go!    ...       green    Aoife   28  0

In 'Text', there are stored comments from users in a community. 'Symbol' includes the most used symbol by a user. 'Note' represents its level (green is more experienced; red is a new joiner) 'Account' is the user name. 'Label' gives information about the user’s trustworthiness (if 0 the user is not fake; if 1 the user might be a possible bot.)

I would like to classify new users based on the current information (see columns above). My dataset includes more than 1000 rows and 400 users. Since to use classifiers I need to encode categorical and text fields, I have tried to do as follows, by using MultiColumnLabelEncoder in sklearn:

MultiColumnLabelEncoder(columns = ['Text', 'Symbol', 'Note', 'Account']).fit_transform(df)

where df is my dataframe. However, I understood that also OneHotEncoder should be preferable. I also included 'Account' as there might be more comments from the same account, so if I classified an account as fake and I receive a new comment from the same account, then this account could be easily detected as fake.

The aim, as I mentioned, would be to classify with a certain accuracy, new elements from a test set based on the information given (symbol, note, age, texts), i.e. looking for possible correlations among these variables which can allow me to say that a new account is fake (1) or not (0).

The problem, as you can see, is related to classifiers where parameters are not only numerical but also categorical.

For data preprocessing (removing stopwords and cleaning data), I have used Python packages of NLTK; regarding features extraction ( this should be a key point as it is linked to the next step, i.e. using a classifier to predict class - 1 or 0), I have found difficulties in understanding what output I should expect from the encoding in order to be able to use information above as inputs in my model (where target is called label and it is a binary value). I am using as classifier logistic regression, but also SVM.

My expected output in case of user X (age 16, symbol #, note Wonderful, and note red - anew joiner) would be classification as fake with a certain percentage.

I would appreciate if someone could explain to me, step by step, the way to transform my dataset in a dataset whose variables I can use within a logistic regression in order to determine the label (fake or not fake) of new users.

hrokr · Accepted Answer

I did this based on some old code of mine that is itself based on scikit-learn working with text. Let me also reference, Scikit-learn 6.2.3 and note that CountVectorizer will be of particular interest as it contains what you want to do with OneHotEncoder and more. From CountVectorizer documentation:

CountVectorizer implements both tokenization and occurrence counting in a single class:

In the example case you provided, you have a total number of 95 words which consist of 22 unique words -- assuming you used all the words which probably isn't what you would want. Put differently, words like "there, is, a, my, was, I, where and which" probably can't help you tell a good account from a bogus one but words like "Nigeria, prince, transfer, bank, penis, or enlargement" are likely indicative of spam.

So, you'd have 22 dimensions (minus whatever excluded ones) of data before you go to the other columns like age, symbol, etc. That's a lot of useless data (all those 0's for nothing you need) so people either store it as a sparse matrix and/or use some sort of dimension reduction like Lasso or Ridge. You might think that's exactly what you want to do right now and that you're on the right track. That would be a bit different than what you asked though. And you kinda are except there are a couple of more points to deal with.

First, and I think this is the important one, some of your fields should be suspect as they are user reported (like age) or are useless/redundant (like the name). No kid goes on a porn or distillery site and says they are 15. No pervy old guy says he's 65 looking to chat with underage kids. Even dating sites where you think people would eventually find out. People lie about their ages. The same goes for names. You can include them if you want but remember the old adage: Garbage In, Garbage Out.

Second, Lasso and Ridge regressions both assign cost functions to help with overfitting models. So, house price base on square footage and zip code makes sense. But when you get down to the last time a property tax assessment was done or the distance to the nearest library you might be thinking "Really?" But that's not really something you have.

Putting those two together, in your case you have Text (definitely useful), symbol (a derivative of text), account and age (see the above note), note (probably useful for the time they've been on and active), and label -- your assessment. So, of five fields, only two are likely to be useful in predicting the assessment. All this is to say that while you can use lasso or ridge, you might be better served using Bayes for this task. If you're up for it there are multiple pages that will show they are equivalent under certain conditions [example]. But the reason to consider Bayes is the computational load for this example.

Symbols (part iv) I've been loathe to say this but, from experience, punctuation is not a good indicator. The reason I say loathe is that you might come up with some novel implementation. But a lot a have tried, so the odds are small. Part of this is related to Zipf's Law which has to do with words rather than punctuation. However, if you make punctuation to carry some sort of additional semantic meaning, it is essentially another word. Remember the goal is not to find that a symbol is in spam, rather, the goal is to find if the symbol is a reliable indicator of spam and is sufficiently unique.

But if you really wanted to add punctuation as some sort of indicator, you might need to think of it differently. For example, is just the presence of a question mark enough? Or, is having three or more in a row? Or, a high percentage of characters per {text, email, message, post, etc}? This gets into feature engineering which is part of why I would say you need to think through it. Personally (and from a quick look through my spam folder) I'd look at emoji, foreign characters (e.g., £) and perhaps text effects (bold, underlined, etc). But you then have a separate and second question. With text content, you have probabilistic loadings with say an aggerate measurement:

print(f"{message} is flagged for consideration at {loading}%.

But amongst those options suggested above you would need to develop some sort of weighting for that feature. You could just append the symbol to each Text field but before TF-IDF. But then you need to use a different approach. You could also assign a weighting to the content and a second one to your engineered feature that would be based off Principle Component Analysis and/or Confusion Matrix.

For example - Text 34 is known spam:

N£w Skinny Pill Kills Too Much Fat? This Diet is Sweeping The Nation

The Bayesian approach assigns an aggregate probability of 94% spam, well above your threshold of 89%. But it's known spam with a probability of 1(00%). The delta of 6% would be due to what most likely? I'd argue in this case it's the £.

The same applies with label. From your train set, you may have zero accounts over 2 years that send spam and 90% come from accounts less than 1 week.

Anyway, on to the code and implementation.

1. Mung data.

This is supervised so 'Label' is critical by definition.

2. Train-test split

You didn't mention this but it's worth noting. sklearn.model_selection.train_test_split

3. Tokenizing text with scikit-learn.

This is where what you're specifically asking starts. Turn the corpus (the collection of documents) into a bag-of-words. You said you were using NLTK which is good for academia but I find overly cumbersome. SpacCy is great, Gensim rocks. But I'm using scikit-learn. My code varies a bit from the example in that it shows a bit of what is going on behind the scenes.

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(lowercase=True, tokenizer=None, stop_words='english',
   analyzer='word', max_df=1.0, min_df=1, max_features=None)
count_vect.fit(your training data)

# uncomment if you'd like to know the mapping of the columns to the words.
# count_vect.vocabulary_
# for key in sorted(count_vect.vocabulary_.keys()):
#     print("{0:<20s} {1}".format(key, count_vect.vocabulary_[key]))

About the training set:

X_train_counts = count_vect.transform(your training data)
print("The type of X_train_counts is {0}.".format(type(X_train_counts)))
print("The X matrix has {0} rows (documents) and {1} columns (words).".format(
        X_train_counts.shape[0], X_train_counts.shape[1]))

That will give you something like this:

The type of X_train_counts is .
The X matrix has 2257 rows (documents) and 35482 columns (words).

4. Convert them to frequencies (tf or tf-idf).

You have occurrences of words. CountVectorizer is just the number of times each word appears in each document. For each document, we would like to normalize by the number of words. This is the term (or word) frequency. IDF is useful in avoiding underflow errors resulting from dividing the one occurrence you have of a word by the gargantuan data set of words. Which is not true in your case but normally is an issue.

5. Ok, now you can start training a classifier.

Stick with the Scikit learn example on this, at least for now. They're using naive Bayes and I laid out my reasoning for why I think Lasso and Ridge aren't best suited in this case. But if you want to go with a regression model, you're set up for it too. If you want to add in your other fields (symbol, age, etc) you might consider just appending them to each record.

At this point I have another couple of steps:

6. Find out the tokens associated with each category as a sniff test.

In general, picking the categories and words associated with each is somewhat of an art. You will probably have to iterate on this.

feature_words = count_vect.get_feature_names()
n = 7 #number of top words associated with the category that we wish to see

for cat in range(len(categories)):
    print(f"
Target: {cat}, name: {target_names[cat]}")
    log_prob = nb_model.feature_log_prob_[cat]
    i_topn = np.argsort(log_prob)[::-1][:n]
    features_topn = [feature_words[i] for i in i_topn]
    print(f"Top {n} tokens: ", features_topn)

7. Prediction as a second sniff test.

A new doc or three that you make up going off similar classification. Then:

X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predictions = nb_model.predict(X_new_tfidf)
print('Predictions')
for doc, category in zip(docs_new, predictions):
    print("{0} => {1}".format(doc, twenty_train.target_names[category]))

8 & 9. Pipeline and Eval as done in the tutorial.

Additional information from follow on questions.

Answers for some parts of your questions have been incorporated above.
Train-test before tokenization or vice versa. I've chosen my words carefully here, so read this part carefully. Currently, it's good practice is to do split then tokenize. The rationale is reproducibility. Others tokenize then split. The rationale is computational efficiency and the term freq would be the same for both. You will see both done all the time. A Data Scientist would test it extensively.
What will output look like? It depends on what your problem is, what stage you're at, and how you code it. You seem to just be doing a spam filter. At some point, you'll have a set of loadings that will typically take the form of word: tf-idf loading with several terms/loadings for each document. You may or may have set at a threshold so you only see the filter results in addition to the model's probabilistic results.
What about the other columns? As I said before, 'Label' is critical as this is supervised learning. Age is useless; Name is probably useless unless all names are unique. Fun fact: there are about 150 named 'math' or 'Math' on StackOverflow. Presumable only one has your user number. 'Symbol' is tricky and you should think hard about it.

Last point. There is a reason why this is a field on it's own. There is a reason why books are written on it and why there are multiple article series on it. So, cramming this onto one wall of text, while concise, is probably sub-optimal in that there is so very much not included that you need to know.