Reputation: 111
Here is my code:
def corpus_reading_pos(corpus_name, pos_tag, option="pos"):
pos_tags = []
words = []
tokens_pos = {}
file_count = 0
for root, dirs, files in os.walk(corpus_name):
for file in files:
if file.endswith(".v4_gold_conll"):
with open((os.path.join(root, file))) as f:
pos_tags += [line.split()[4] for line in f if line.strip() and not line.startswith("#")]
with open((os.path.join(root, file))) as g:
words += [line.split()[3] for line in g if line.strip() and not line.startswith("#")]
file_count += 1
for pos in pos_tags:
tokens_pos[pos] = []
words_pos = list(zip(words, pos_tags))
for word in words_pos:
tokens_pos[word[1]] = word[0]
#print(words_pos)
print(tokens_pos)
#print(words)
print("Token count:", len(tokens_pos))
print("File count:", file_count)
I'm trying to create a dictionary that has all of the pos items as keys, and the dictionary values will be all of the words that belong to that specific pos. I'm stuck on the par where for the values in the dictionary, I have to create a list of words, but I can't seem to get there.
In the code, the line tokens_pos[word[1]] = word[0] only adds one word per key, but if I try something like [].append(word[0]), the dictionary returns all values as NONE.
Upvotes: 2
Views: 89
Reputation: 1975
You seem to be doing a lot of double work but to give a solution to your specific question:
for word in words_pos:
tokens_pos[word[1]].append(word[0])
should do what you want to achieve.
with
tokens_pos[word[1]] = word[0]
you are basically overwriting existing values that have the same key, and thus only the last written value with that key will remain in the end.
Upvotes: 3