Mitchell.Laferla
Mitchell.Laferla

Reputation: 253

How can I vectorize a series of tokens

I have a series of tokens that I am attempting to vectorize. However, I keep getting the error message "TypeError: expected string or bytes-like object".

My text tokens:

tokens_raw
0 [kitchen, getting, children, ready, school, ru...
1 [shanghai, appointed, manager, taco, bell, chi...
2 [april, uber, announced, acquisition, otto, sa... etc.....

My code:

# list of tokens from text documents
tokens_raw = articles_df_60k['processed_content']

# create the transform
vectorizer = TfidfVectorizer(sublinear_tf=True, min_df=5, lowercase=False,
                             encoding='latin-1', ngram_range=(1, 2))

# tokenize and build vocab
vectorizer.fit(tokens_raw)
 
# summarize
print("vocabulary count:", vectorizer.vocabulary_, sep='\n')
print('\n')
print("inverse document frequency:", vectorizer.idf_, sep='\n')
print('\n')

# encode document
vector = vectorizer.transform(tokens_raw)

# summarize encoded vector
print("vector shape:", vector_raw.shape, sep='\n')
print('\n')
print("vector array:", vector_raw.toarray(), sep='\n')

Again, the error message is "TypeError: expected string or bytes-like object". It works if I just input

tokens_raw[0]

however, trying to apply it to all the rows turns back the error message. Any guidance, explanations, or solutions would be much appreciated.

Upvotes: 2

Views: 2683

Answers (2)

mbatchkarov
mbatchkarov

Reputation: 16039

As @Praveen said, sklearn's vectorizers expect a list of strings by default. To understand the root cause of your problem you should understand what happens to each string. Sklearn's text module (see feature_exraction/text.py and especially build_analyzer and _analyze) takes each string and performs roughly the following analysis in order:

  • decode any "funny" characters
  • preprocess the text, e.g. by replacing accented characters with their ascii equivalent
  • tokenise the text. At this point each document (string) will be a list of tokens (list of strings)
  • extract n-grams

The important point is each of these steps is configurable and can be overridden. Your data already looks tidy, so you don't need any decoding or character normalisation. Your text is already tokenised, so you can tell sklearn not to do anything:

from sklearn.feature_extraction.text import TfidfVectorizer

tokens_raw = [
    ["kitchen", "getting", "children", "ready", "school"],
    ["shanghai", "appointed", "manager", "taco", "bell"],
]

vectorizer = TfidfVectorizer(analyzer=lambda x: x)
X = vectorizer.fit_transform(tokens_raw)

Upvotes: 2

Praveen Sujanmulk
Praveen Sujanmulk

Reputation: 41

TFIDF Vectorizer expects input data to be string.But you are giving a list of words.

tokens_raw=[['kitchen', 'getting', 'children', 'ready', 'school'],['shanghai', 'appointed', 'manager', 'taco', 'bell']]
tokens_raw=[" ".join(t) for t in tokens_raw]
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X =vectorizer.fit_transform(tokens_raw)
print("Shape: ",X.shape)
print(X.todense())

Output:

Shape:  (2, 10)
[[0.        0.        0.4472136 0.4472136 0.4472136 0.        0.4472136
  0.4472136 0.        0.       ]
 [0.4472136 0.4472136 0.        0.        0.        0.4472136 0.
  0.        0.4472136 0.4472136]]

Hope this works.

Upvotes: 2

Related Questions