How can I vectorize a series of tokens

Question

I have a series of tokens that I am attempting to vectorize. However, I keep getting the error message "TypeError: expected string or bytes-like object".

My text tokens:

tokens_raw
0 [kitchen, getting, children, ready, school, ru...
1 [shanghai, appointed, manager, taco, bell, chi...
2 [april, uber, announced, acquisition, otto, sa... etc.....

My code:

# list of tokens from text documents
tokens_raw = articles_df_60k['processed_content']

# create the transform
vectorizer = TfidfVectorizer(sublinear_tf=True, min_df=5, lowercase=False,
                             encoding='latin-1', ngram_range=(1, 2))

# tokenize and build vocab
vectorizer.fit(tokens_raw)
 
# summarize
print("vocabulary count:", vectorizer.vocabulary_, sep='
')
print('
')
print("inverse document frequency:", vectorizer.idf_, sep='
')
print('
')

# encode document
vector = vectorizer.transform(tokens_raw)

# summarize encoded vector
print("vector shape:", vector_raw.shape, sep='
')
print('
')
print("vector array:", vector_raw.toarray(), sep='
')

Again, the error message is "TypeError: expected string or bytes-like object". It works if I just input

tokens_raw[0]

however, trying to apply it to all the rows turns back the error message. Any guidance, explanations, or solutions would be much appreciated.

mbatchkarov · Accepted Answer

As @Praveen said, sklearn's vectorizers expect a list of strings by default. To understand the root cause of your problem you should understand what happens to each string. Sklearn's text module (see feature_exraction/text.py and especially build_analyzer and _analyze) takes each string and performs roughly the following analysis in order:

decode any "funny" characters
preprocess the text, e.g. by replacing accented characters with their ascii equivalent
tokenise the text. At this point each document (string) will be a list of tokens (list of strings)
extract n-grams

The important point is each of these steps is configurable and can be overridden. Your data already looks tidy, so you don't need any decoding or character normalisation. Your text is already tokenised, so you can tell sklearn not to do anything:

from sklearn.feature_extraction.text import TfidfVectorizer

tokens_raw = [
    ["kitchen", "getting", "children", "ready", "school"],
    ["shanghai", "appointed", "manager", "taco", "bell"],
]

vectorizer = TfidfVectorizer(analyzer=lambda x: x)
X = vectorizer.fit_transform(tokens_raw)

How can I vectorize a series of tokens

Answers (2)

Related Questions