How to transform the data and calculate the TFIDF value?

Question

My data format is： datas = {[1,2,4,6,7],[2,3],[5,6,8,3,5],[2],[93,23,4,5,11,3,5,2],...} Each element in datas is a sentence ,and each number is a word.I want to get the TFIDF value for each number. How to do it with sklearn or other ways?

My code:

from sklearn.feature_extraction.text import TfidfTransformer  
from sklearn.feature_extraction.text import CountVectorizer  
datas = {[1,2,4,6,7],[2,3],[5,6,8,3,5],[2],[93,23,4,5,11,3,5,2]}
vectorizer=CountVectorizer()

transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(datas))  
print(tfidf)

My code doesn't work.Error:

Traceback (most recent call last):   File
"C:/Users/zhuowei/Desktop/OpenNE-master/OpenNE-
master/src/openne/buildTree.py", line 103, in 
    X = vectorizer.fit_transform(datas)   File
"C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction	ext.py", line 869, in fit_transform
    self.fixed_vocabulary_)   File "C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction	ext.py", line 792, in _count_vocab
    for feature in analyze(doc):   File 
"C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction	ext.py", line 266, in 
    tokenize(preprocess(self.decode(doc))), stop_words)   File 
"C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction	ext.py", line 232, in 
    return lambda x: strip_accents(x.lower()) 
AttributeError: 'int' object has no attribute 'lower'

Vivek Kumar · Accepted Answer

You are using CountVectorizer which requires an iterable of strings. Something like:

datas = ['First sentence', 
         'Second sentence', ...
          ...
         'Yet another sentence']

But your data is a list of lists, which is why the error occurs. You need to make the inner lists as strings for the CountVectorizer to work. You can do this:

datas = [' '.join(map(str, x)) for x in datas]

This will result in datas like this:

['1 2 4 6 7', '2 3', '5 6 8 3 5', '2', '93 23 4 5 11 3 5 2']

Now this form is consumable by CountVectorizer. But even then you will not get proper results, because of the default token_pattern in CountVectorizer:

token_pattern : ’(?u)\b\w\w+\b’

string Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator)

In order for it to consider your numbers as words, you will need to change it so that it can accept single letters as words by doing this:

vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")

Then it should work. But now your numbers are changed into strings

How to transform the data and calculate the TFIDF value?

Answers (1)

Related Questions