Reputation: 63
My data format is:
datas = {[1,2,4,6,7],[2,3],[5,6,8,3,5],[2],[93,23,4,5,11,3,5,2],...}
Each element in datas is a sentence ,and each number is a word.I want to get the TFIDF value for each number. How to do it with sklearn or other ways?
My code:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
datas = {[1,2,4,6,7],[2,3],[5,6,8,3,5],[2],[93,23,4,5,11,3,5,2]}
vectorizer=CountVectorizer()
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(datas))
print(tfidf)
My code doesn't work.Error:
Traceback (most recent call last): File
"C:/Users/zhuowei/Desktop/OpenNE-master/OpenNE-
master/src/openne/buildTree.py", line 103, in <module>
X = vectorizer.fit_transform(datas) File
"C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 869, in fit_transform
self.fixed_vocabulary_) File "C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 792, in _count_vocab
for feature in analyze(doc): File
"C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 266, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words) File
"C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 232, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'int' object has no attribute 'lower'
Upvotes: 0
Views: 209
Reputation: 36599
You are using CountVectorizer
which requires an iterable of strings. Something like:
datas = ['First sentence',
'Second sentence', ...
...
'Yet another sentence']
But your data is a list of lists, which is why the error occurs. You need to make the inner lists as strings for the CountVectorizer to work. You can do this:
datas = [' '.join(map(str, x)) for x in datas]
This will result in datas
like this:
['1 2 4 6 7', '2 3', '5 6 8 3 5', '2', '93 23 4 5 11 3 5 2']
Now this form is consumable by CountVectorizer
. But even then you will not get proper results, because of the default token_pattern
in CountVectorizer:
token_pattern : ’(?u)\b\w\w+\b’
string Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator)
In order for it to consider your numbers as words, you will need to change it so that it can accept single letters as words by doing this:
vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
Then it should work. But now your numbers are changed into strings
Upvotes: 3