Reputation: 253
I have a series of tokens that I am attempting to vectorize. However, I keep getting the error message "TypeError: expected string or bytes-like object".
My text tokens:
tokens_raw
0 [kitchen, getting, children, ready, school, ru...
1 [shanghai, appointed, manager, taco, bell, chi...
2 [april, uber, announced, acquisition, otto, sa... etc.....
My code:
# list of tokens from text documents
tokens_raw = articles_df_60k['processed_content']
# create the transform
vectorizer = TfidfVectorizer(sublinear_tf=True, min_df=5, lowercase=False,
encoding='latin-1', ngram_range=(1, 2))
# tokenize and build vocab
vectorizer.fit(tokens_raw)
# summarize
print("vocabulary count:", vectorizer.vocabulary_, sep='\n')
print('\n')
print("inverse document frequency:", vectorizer.idf_, sep='\n')
print('\n')
# encode document
vector = vectorizer.transform(tokens_raw)
# summarize encoded vector
print("vector shape:", vector_raw.shape, sep='\n')
print('\n')
print("vector array:", vector_raw.toarray(), sep='\n')
Again, the error message is "TypeError: expected string or bytes-like object". It works if I just input
tokens_raw[0]
however, trying to apply it to all the rows turns back the error message. Any guidance, explanations, or solutions would be much appreciated.
Upvotes: 2
Views: 2683
Reputation: 16039
As @Praveen said, sklearn's vectorizers expect a list of strings by default. To understand the root cause of your problem you should understand what happens to each string. Sklearn's text module (see feature_exraction/text.py
and especially build_analyzer
and _analyze
) takes each string and performs roughly the following analysis in order:
The important point is each of these steps is configurable and can be overridden. Your data already looks tidy, so you don't need any decoding or character normalisation. Your text is already tokenised, so you can tell sklearn not to do anything:
from sklearn.feature_extraction.text import TfidfVectorizer
tokens_raw = [
["kitchen", "getting", "children", "ready", "school"],
["shanghai", "appointed", "manager", "taco", "bell"],
]
vectorizer = TfidfVectorizer(analyzer=lambda x: x)
X = vectorizer.fit_transform(tokens_raw)
Upvotes: 2
Reputation: 41
TFIDF Vectorizer expects input data to be string.But you are giving a list of words.
tokens_raw=[['kitchen', 'getting', 'children', 'ready', 'school'],['shanghai', 'appointed', 'manager', 'taco', 'bell']]
tokens_raw=[" ".join(t) for t in tokens_raw]
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X =vectorizer.fit_transform(tokens_raw)
print("Shape: ",X.shape)
print(X.todense())
Output:
Shape: (2, 10)
[[0. 0. 0.4472136 0.4472136 0.4472136 0. 0.4472136
0.4472136 0. 0. ]
[0.4472136 0.4472136 0. 0. 0. 0.4472136 0.
0. 0.4472136 0.4472136]]
Hope this works.
Upvotes: 2