Kushal Shah
Kushal Shah

Reputation: 165

NumPy or Dictionary?

I have to deal with a large data-set. I need to store term frequency of each sentence; which I can do either using a dictionary list or using NumPy array.

But, I will have to sort and append (in case the word already exists)- Which will be better in this case?

Upvotes: 7

Views: 1581

Answers (2)

AvidLearner
AvidLearner

Reputation: 4163

The Solution to the problem you are describing is a scipy's sparse matrix.

A small example:

from scipy.sparse import csr_matrix
docs = [["hello", "world", "hello"], ["goodbye", "cruel", "world"]]
indptr = [0]
indices = []
data = []
vocabulary = {}
for d in docs:
    for term in d:
        index = vocabulary.setdefault(term, len(vocabulary))
        indices.append(index)
        data.append(1)
    indptr.append(len(indices))

print csr_matrix((data, indices, indptr), dtype=int).toarray()

Each sentence is row, and each term is a column.

One more tip - check out CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=2)
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
vectorizer = vectorizer.fit(corpus)
print vectorizer.vocabulary_ 
#prints {u'this': 4, u'is': 2, u'the': 3, u'document': 0, u'first': 1}
X = vectorizer.transform(corpus)
    
print X.toarray()
#prints
 [[1 1 1 1 1]
 [1 0 1 1 1]
 [0 0 0 1 0]
 [1 1 1 1 1]]

And now X is your document-term matrix (Note that X is csr_matrix). You can also use TfidfTransformer in case you want to tf-idf it.

Upvotes: 5

ldirer
ldirer

Reputation: 6756

As you mention in the comments you don't know the size of the words/tweets matrix that you will eventually obtain, so that makes using an array a cumbersome solution.

It feels more natural to use a dictionary here, for the reasons you noted. The keys of the dictionary will be the words in the tweets, and the values can be lists with (tweet_id, term_frequency) elements.

Eventually you might want to do something else (e.g. classification) with your term frequencies. I suspect this is why you want to use a numpy array from the start. It should not be too hard to convert the dictionary to a numpy array afterwards though, if that is what you wish to do.

However note that this array is likely to be both very big (1M * number of words) and very sparse, which means it will contain mostly zeros. Because this numpy array will take a lot of memory to store a lot of zeros, you might want to look at a data structure that is more memory efficient to store sparse matrix (see scipy.sparse).

Hope this helps.

Upvotes: 1

Related Questions