How to create term frequency matrix for multiple text files?

Question

I have the following code for a total of four text files all containing a few different keywords. They are called test1.txt, test2.txt, test3.txt and test4.txt. I want to transform it into a matrix/list of lists. I have the following code.

temp = [''] + list(sample_collection)
values = list(sample_collection['test1.txt'])

sample_collection = [temp] + [[x] + [v.get(x, 0) for v in sample_collection.values()] for x in values]

However, I want to modify it to include not only the keywords from test1, but all other unique keywords from the other files. I have no clue how to do so. Is there a way to do so with that piece of code?

expected output:

[['', 'test1.txt', 'test2.txt', 'test3.txt', 'test4.txt'],
['apple', 1, 0, 2, 1],
['banana', 1, 1, 1, 1],
['lemon', 1, 1, 0, 0],
['grape', 0, 0, 0, 1]]

Green · Accepted Answer

I would use sklearn framework.

It isn't a part of python base packages, so you will need to install it (pip install sklearn).

than, import the CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer

read you files and store them in a list. let's say you will call it my_corpus. now you have a list named my_corpus with 4 members.

just use:

vectorizer =  CountVectorizer()    
matrix = vectorizer.fit_transform(my_corpus)

Alternativly, if you wouldn't like to use a oter packages, just do: corpus = ["I like dogs", "I like cats", "cats like milk", "You likes me"]
token_corpus = [s.split() for s in corpus]

vocabulary = {}                                                                      
for i, f in enumerate(token_corpus):                                                 
    for t in f:                                                                      
        if t not in vocabulary:                                                      
             vocabulary[t] = [0]*len(corpus)                                         
        vocabulary[t][i]+=1                                                          

vocabulary
{'I': [1, 1, 0, 0], 'like': [1, 1, 1, 0], 'dogs': [1, 0, 0, 0], 'cats': [0, 1, 1, 0], 'milk': [0, 0, 1, 0], 'You': [0, 0, 0, 1], 'likes': [0, 0, 0, 1], 'me': [0, 0, 0, 1]}

if you want to save it in a list just use:

list(map(list, vocabulary.items()))
[['I', [1, 1, 0, 0]], ['like', [1, 1, 1, 0]], ['dogs', [1, 0, 0, 0]], ['cats', [0, 1, 1, 0]], ['milk', [0, 0, 1, 0]], ['You', [0, 0, 0, 1]], ['likes', [0, 0, 0, 1]], ['me', [0, 0, 0, 1]]]

How to create term frequency matrix for multiple text files?

Answers (1)

Related Questions