DimiDev
DimiDev

Reputation: 93

Bag of Words encoding for Python with vocabulary

I am trying to implement new columns into my ML model. A numeric column should be created if a specific word is found in the text of the scraped data. For this I created a dummy script for testing.

import pandas as pd

bagOfWords = ["cool", "place"]
wordsFound = ""

mystring = "This is a cool new place"
mystring = mystring.lower()

for word in bagOfWords:
    if word in mystring: 
        wordsFound = wordsFound + word + " "

print(wordsFound)
pd.get_dummies(wordsFound)

The output is

    cool place
0   1

This means there is one sentence "0" and one entry of "cool place". This is not correct. Expectations would be like this:

    cool place
0   1    1

Upvotes: 0

Views: 353

Answers (1)

DimiDev
DimiDev

Reputation: 93

Found a different solution, as I cound not find any way forward. Its a simple direct hot encoding. For this I enter for every word I need a new column into the dataframe and create the encoding directly.

vocabulary = ["achtung", "suchen"]

for word in vocabulary:
    df2[word] = 0

    for index, row in df2.iterrows():
        if word in row["title"].lower():
            df2.set_value(index, word, 1)

Upvotes: 0

Related Questions