Reputation: 385
I have a corpus which is a list of tuple, with the tuple containing a word and a POS tag. My question right now is given every word and every POS tag that exists in the corpus, iterate through the corpus and record the amount of time each word and tag combo exist in the corpus. If the word and tag combo does not exist in the corpus make the count 0.
possible_tags = ['Verb','Noun','Det']
possible_words = ['Merger', 'proposed', 'Wards', 'protected', 'A']
corpus = [('Merger', 'Noun'), ('proposed', 'Verb'), ('Wards', 'Noun'), ('protected', 'Verb'), ('A', 'Det'), ('Merger','Noun')]
output = {'Merger_Noun':2, 'Merger_Verb':0, 'Merger_Det':0, 'proposed_Noun':0, 'proposed_Verb':1, 'proposed_Det':0, ....... }
Upvotes: 1
Views: 321
Reputation: 462
Try converting everything to a dictionary to make it easier.
possible_tags = ['Verb','Noun','Det']
possible_words = ['Merger', 'proposed', 'Wards', 'protected', 'A']
corpus = [('Merger', 'Noun'), ('proposed', 'Verb'), ('Wards', 'Noun'), ('protected', 'Verb'), ('A', 'Det'), ('Merger','Noun')]
#Initialize output to empty dictionary
output = {}
//dictionary initialization.
for each_word in possible_words:
for each_tag in possible_tags:
key = each_word + "_" + each_tag
output[key] = 0
#iterate through corpus
for each in corpus:
#extract each tuple, and update dictionary with keys as string and count as integer
output[each[0] +"_"+each[1]] += 1
Upvotes: 1