Reputation: 1
I want to count in how many documents a particular word appears. For example, the word "Dog" appeared in 67 documents out of 100 documents.
1 document is equivalent to 1 file.
So therefore, the frequency of the word "Dog" need not to count. For example in document 1, "Dog" appeared 250 times, BUT it will only considered as one count, since my goal is to count the documents not how many times did the word "Dog" appeared in a specific document.
Example:
So answer must be 4
I have my own algorithm but I believe there's an efficient way to do this. I'm using Python 3.4 with NLTK libraries. I need help. Thank yoy guys!
Here's my code
# DOCUMENT FREQUENCY
for eachadd in wordwithsource:
for eachaddress in wordwithsource:
if eachaddress == eachadd:
if eachaddress not in copyadd:
countofdocs=0
copyadd.append(eachaddress)
countofdocs = countofdocs+1
addmanipulation.append(eachaddress[0])
for everyx in addmanipulation:
documentfrequency = addmanipulation.count(everyx)
if everyx not in otherfilter:
otherfilter.append(everyx)
documentfrequencylist.append([everyx,documentfrequency])
#COMPARE WORDS INTO DOC FREQUENCY
for everywords in tempwords:
for everydocfreq in documentfrequencylist:
if everywords.find(everydocfreq[0]) !=-1:
docfreqofficial.append(everydocfreq[1])
for everydocfrequency in docfreqofficial:
docfrequency=(math.log10(numberofdocs/everydocfrequency))
docfreqanswer.append(docfrequency)
Upvotes: 0
Views: 3950
Reputation: 81
This can be done in gensim.
from gensim import corpora
dictionary = corpora.Dictionary(doc for doc in corpus)
dictionary.dfs
doc is a list of tokens and corpus is a list of documents. The Dictionary instance also stores overall term frequencies (cfs).
https://radimrehurek.com/gensim/corpora/dictionary.html
Upvotes: 2
Reputation: 193
You could store a frequency dictionary for each document and use another global dictionary for the document frequency of words. I have used Counter for simplicity.
from collections import Counter
#using a list to simulate document store which stores documents
documents = ['This is document %d' % i for i in range(5)]
#calculate words frequencies per document
word_frequencies = [Counter(document.split()) for document in documents]
#calculate document frequency
document_frequencies = Counter()
map(document_frequencies.update, (word_frequency.keys() for word_frequency in word_frequencies))
print(document_frequencies)
>>>...Counter({'This': 5, 'is': 5, 'document': 5, '1': 1, '0': 1, '3': 1, '2': 1, '4': 1})
Upvotes: 1