Finding document frequency using Python

Question

Hey everyone I know that this has been asked a couple times here already but I am having a hard time finding document frequency using python. I am trying to find TF-IDF then find the cosin scores between them and a query but am stuck at finding document frequency. This is what I have so far:

#includes
import re
import os
import operator
import glob
import sys
import math
from collections import Counter

#number of command line argument checker
if len(sys.argv) != 3:
    print 'usage: ./part3_soln2.py "path to folder in quotation marks" query.txt'
    sys.exit(1)

#Read in the directory to the files
    path = sys.argv[1]

#Read in the query
y = sys.argv[2]
querystart = re.findall(r'\w+', open(y).read().lower())
query = [Z for Z in querystart]
Query_vec = Counter(query)
print Query_vec

#counts total number of documents in the directory
doccounter = len(glob.glob1(path,"*.txt"))

if os.path.exists(path) and os.path.isfile(y):
    word_TF = []
    word_IDF = {}
    TFvec = []
    IDFvec = []

    #this is my attempt at finding IDF
    for filename in glob.glob(os.path.join(path, '*.txt')):

        words_IDF = re.findall(r'\w+', open(filename).read().lower())

        doc_IDF = [A for A in words_IDF if len(A) >= 3 and A.isalpha()]

        word_IDF = doc_IDF

        #psudocode!! 
        """
        for key in word_idf:
            if key in word_idf:
                word_idf[key] =+1
            else:
                word_idf[key] = 1

    print word_IDF
    """ 

    #goes to that directory and reads in the files there
    for filename in glob.glob(os.path.join(path, '*.txt')):

        words_TF = re.findall(r'\w+', open(filename).read().lower())

        #scans each document for words greater or equal to 3 in length
        doc_TF = [A for A in words_TF if len(A) >= 3 and A.isalpha()]

        #this assigns values to each term this is my TF for each vector
        TFvec = Counter(doc_TF)

        #weighing the Tf with a log function
        for key in TFvec: 
            TFvec[key] = 1 + math.log10(TFvec[key])


    #placed here so I dont get a command line full of text  
    print TFvec 

#Error checker
else:
    print "That path does not exist"

I am using python 2 and so far I don't really have any idea how to count how many documents a term appears in. I can find the total number of documents but I am really stuck on finding the number of documents a term appears in. I was just going to create one large dictionary that held all of the terms from all of the documents that could be fetched later when a query needed those terms. Thank you for any help you can give me.

Mike Koltsov · Accepted Answer

DF for a term x is a number of documents in which x appears. In order to find that, you need to iterate over all documents first. Only then you can compute IDF from DF.

You can use a dictionary for counting DF:

Iterate over all documents
For each document, retrieve the set of it's words (without repetitions)
Increase the DF count for each word from stage 2. Thus you will increase the count exactly by one, regardless of how many times the word was in document.

Python code could look like this:

from collections import defaultdict
import math

DF = defaultdict(int) 
for filename in glob.glob(os.path.join(path, '*.txt')):
    words = re.findall(r'\w+', open(filename).read().lower())
    for word in set(words):
        if len(word) >= 3 and word.isalpha():
            DF[word] += 1  # defaultdict simplifies your "if key in word_idf: ..." part.

# Now you can compute IDF.
IDF = dict()
for word in DF:
    IDF[word] = math.log(doccounter / float(DF[word])) # Don't forget that python2 uses integer division.

PS It's good for learning to implement things manually, but if you ever get stuck, I suggest you to look at NLTK package. It provides useful functions for working with corpora (collection of texts).

Finding document frequency using Python

Answers (1)

Related Questions