Reputation: 61
I have the following code for a total of four text files all containing a few different keywords. They are called test1.txt, test2.txt, test3.txt and test4.txt. I want to transform it into a matrix/list of lists. I have the following code.
temp = [''] + list(sample_collection)
values = list(sample_collection['test1.txt'])
sample_collection = [temp] + [[x] + [v.get(x, 0) for v in sample_collection.values()] for x in values]
However, I want to modify it to include not only the keywords from test1, but all other unique keywords from the other files. I have no clue how to do so. Is there a way to do so with that piece of code?
expected output:
[['', 'test1.txt', 'test2.txt', 'test3.txt', 'test4.txt'],
['apple', 1, 0, 2, 1],
['banana', 1, 1, 1, 1],
['lemon', 1, 1, 0, 0],
['grape', 0, 0, 0, 1]]
Upvotes: 0
Views: 361
Reputation: 2565
I would use sklearn
framework.
It isn't a part of python base packages, so you will need to install it (pip install sklearn
).
than, import the CountVectorizer
:
from sklearn.feature_extraction.text import CountVectorizer
read you files and store them in a list.
let's say you will call it my_corpus
. now you have a list named my_corpus
with 4 members.
just use:
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(my_corpus)
Alternativly, if you wouldn't like to use a oter packages, just do:
corpus = ["I like dogs", "I like cats", "cats like milk", "You likes me"]
token_corpus = [s.split() for s in corpus]
vocabulary = {}
for i, f in enumerate(token_corpus):
for t in f:
if t not in vocabulary:
vocabulary[t] = [0]*len(corpus)
vocabulary[t][i]+=1
vocabulary
{'I': [1, 1, 0, 0], 'like': [1, 1, 1, 0], 'dogs': [1, 0, 0, 0], 'cats': [0, 1, 1, 0], 'milk': [0, 0, 1, 0], 'You': [0, 0, 0, 1], 'likes': [0, 0, 0, 1], 'me': [0, 0, 0, 1]}
if you want to save it in a list just use:
list(map(list, vocabulary.items()))
[['I', [1, 1, 0, 0]], ['like', [1, 1, 1, 0]], ['dogs', [1, 0, 0, 0]], ['cats', [0, 1, 1, 0]], ['milk', [0, 0, 1, 0]], ['You', [0, 0, 0, 1]], ['likes', [0, 0, 0, 1]], ['me', [0, 0, 0, 1]]]
Upvotes: 2