Reputation: 1

Document Frequency in Python

I want to count in how many documents a particular word appears. For example, the word "Dog" appeared in 67 documents out of 100 documents.

1 document is equivalent to 1 file.

So therefore, the frequency of the word "Dog" need not to count. For example in document 1, "Dog" appeared 250 times, BUT it will only considered as one count, since my goal is to count the documents not how many times did the word "Dog" appeared in a specific document.

Example:

Document 1: Dog appeared 250 times
Document 2: Dog appeared 1000 times
Document 3: Dog appeared 1 time
Document 4: Dog appeared 0 time
Document 5: Dog appeared 2 times

So answer must be 4

I have my own algorithm but I believe there's an efficient way to do this. I'm using Python 3.4 with NLTK libraries. I need help. Thank yoy guys!

Here's my code

# DOCUMENT FREQUENCY
for eachadd in wordwithsource:
    for eachaddress in wordwithsource:
        if eachaddress == eachadd:
            if eachaddress not in copyadd:
                countofdocs=0
                copyadd.append(eachaddress)
                countofdocs = countofdocs+1
                addmanipulation.append(eachaddress[0])

for everyx in addmanipulation:
    documentfrequency = addmanipulation.count(everyx)
    if everyx not in otherfilter:
        otherfilter.append(everyx)
        documentfrequencylist.append([everyx,documentfrequency])

#COMPARE WORDS INTO DOC FREQUENCY 
for everywords in tempwords:
    for everydocfreq in documentfrequencylist:
        if everywords.find(everydocfreq[0]) !=-1:
            docfreqofficial.append(everydocfreq[1])

for everydocfrequency in docfreqofficial:
    docfrequency=(math.log10(numberofdocs/everydocfrequency))
    docfreqanswer.append(docfrequency)

Upvotes: 0

Answers (2)

Ryan Boch

Reputation: 81

This can be done in gensim.

from gensim import corpora

dictionary = corpora.Dictionary(doc for doc in corpus)
dictionary.dfs

doc is a list of tokens and corpus is a list of documents. The Dictionary instance also stores overall term frequencies (cfs).

https://radimrehurek.com/gensim/corpora/dictionary.html

Upvotes: 2

gnub

Reputation: 193

You could store a frequency dictionary for each document and use another global dictionary for the document frequency of words. I have used Counter for simplicity.

from collections import Counter

#using a list to simulate document store which stores documents
documents = ['This is document %d' % i for i in range(5)]

#calculate words frequencies per document
word_frequencies = [Counter(document.split()) for document in documents]

#calculate document frequency
document_frequencies = Counter()
map(document_frequencies.update, (word_frequency.keys() for word_frequency in word_frequencies))

print(document_frequencies)

>>>...Counter({'This': 5, 'is': 5, 'document': 5, '1': 1, '0': 1, '3': 1, '2': 1, '4': 1})

Upvotes: 1

Document Frequency in Python

Answers (2)

Related Questions