aless80
aless80

Reputation: 3332

Python CountVectorizer: presence of term in documents

I am doing LDA analysis with Python. Is there an out of the box way of getting how many texts of my corpus (which is a list of text strings) a word (Edit: a term of n words) is present?

The answer here by @titipata gives the word frequency: How to extract word frequency from document-term matrix?

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
texts = ['hey you', 'you ah ah ah']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
freq = np.ravel(X.sum(axis=0))

import operator
# get vocabulary keys, sorted by value
vocab = [v[0] for v in sorted(vectorizer.vocabulary_.items(),     key=operator.itemgetter(1))]
fdist = dict(zip(vocab, freq)) # return same format as nltk

The word frequency is here:

fdist
{u'ah': 3, u'you': 2, u'hey': 1}

but I want

presence
{u'ah': 1, u'you': 2, u'hey': 1}

Edit: this should also work for terms of N-words, which you can define

I can calculate what I want as below, but is there a faster way from CountVectorizer?

presence={}
for w in vocab:
    pres=0
    for t in texts:
        pres+=w in set(t.split())
    presence[w]=pres

Edit: what I just wrote for presence does not work for terms of N words. This works but is slow:

counter = Counter()
for t in texts:
    for term in vectorizer.get_feature_names():
        counter.update({term: term in t})

Upvotes: 0

Views: 1043

Answers (1)

Carlo Mazzaferro
Carlo Mazzaferro

Reputation: 858

If your corpus is not too large, this should work nicely and quite fast. Also, it relies on python in-builts. See the documentation for Counter.

from collections import Counter

corpus = ['hey you', 'you ah ah ah']
sents = []

for sent in corpus:
    sents.extend(list(set(sent.split())))   # Use set et to ensure single count

Counter(sents)

Returns:

Counter({'ah': 1, 'hey': 1, 'you': 2})

Upvotes: 2

Related Questions