Reputation: 1582
So I am taking a natural language processing class and I need to create a trigram language model to generate random text that looks "realistic" to a certain degree based off of some sample data.
Essencially need to create a "trigram" to hold the various 3 letter grammar word combinations. My professor hints that this can be done by having a dictionary of dictionaries of dictionaries which I attempted to create using:
trigram = defaultdict( defaultdict(defaultdict(int)))
However I get an error that says:
trigram = defaultdict( dict(dict(int)))
TypeError: 'type' object is not iterable
How would I do about created a 3 layer nested dictionary or a dictionary of dictionaries of dictionaries of int
values?
I guess people vote down a question on stack overflow if they don't know how to answer it. I'll add some background to better explain the question for those willing to help.
This trigram is used to keep track of triple word patterns. The are used in text language processing software and almost everywhere throughout natural language processing "think siri or google now".
If we designate the 3 levels of dictionaries as dict1 dict2 and dict3 then parsing a text file and reading a statement "The boy runs" would have the following:
A dict1 which has a key of "the". Accessing that key would return dict2 which contains the key "boy". Accessing that key would return the final dict3 which would contain the key "runs" now accessing that key would return the value 1.
This symbolizes that in this text "the boy runs" has appeared 1 time. If we encounter it again then we would follow the same process and increment 1 to two. If we encounter "the girl walks" then dict2 the "the" keys dictionary will now contain another key for "girl" which would have a dict3 that has a key of "walks" and a value of 1 and so forth. Eventually after parsing a ton of text (and keeping track of the word count" you will have a trigram which can determine the likeliness of a certain starting word leading to a 3 word combination based off the frequency of times they appeared in the previously parsed text.
This can help you create grammar rules to identify languages or in my case created randomly generated text that looks very much like grammatical english. I need a three layer dictionary because at any position of a 3 word combination there can be another word that can create a whole different set of combinations. I TRIED my best to explain trigrams and the purpose behind them to the best of my ability... granted I just stated the class a couple weeks ago.
Now... with ALL of that being said. How would I go about creating a dictionary of dictionaries of dictionaries whose base dictionary holds values of type int in python?
trigram = defaultdict( defaultdict(defaultdict(int)))
throws an error for me
Upvotes: 7
Views: 9311
Reputation: 1414
The defaultdict __init__
method takes an argument that is required to be a callable. The callable passed to defaultdict
must be callable with no arguments, and must return an instance of the default value.
The problem with nesting defaultdict
as you did was that defaultdict
's __init__
takes an argument. Giving defaultdict
that argument means that rather than the wrapping defaultdict
having a callable as its __init__
argument, it has an instance of defaultdict
, which is not callable.
The lambda
solution by @pcoving will work, because it creates an anonymous function which returns a defaultdict
initialized with a function that returns the correct type defaultdict
for each layer in the dictionary nesting.
Upvotes: 1
Reputation: 2788
I've tried nested defaultdict
's before and the solution seems to be a lambda
call:
trigram = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
trigram['a']['b']['c'] += 1
It's not pretty, but I suspect the nested dictionary suggestion is for efficient lookup.
Upvotes: 13
Reputation: 122042
If it's just extracting and retrieving trigrams, you should try this with NLTK
:
>>> import nltk
>>> sent = "this is a foo bar crazycoder"
>>> trigrams = nltk.ngrams(sent.split(), 3)
[('this', 'is', 'a'), ('is', 'a', 'foo'), ('a', 'foo', 'bar'), ('foo', 'bar', 'crazycoder')]
# token "a" in first element of trigram
>>> first_a = [i for i in trigrams if i[0] == "a"]
[('a', 'foo', 'bar')]
# token "a" in 2nd element of trigram
>>> second_a = [i for i in trigrams if i[1] == "a"]
[('is', 'a', 'foo')]
# token "a" in third element of trigram
>>> third = [i for i in trigrams if i[2] == "a"]
[('this', 'is', 'a')]
# look for 2gram in trigrams
>> two_foobar = [i for i in trigrams if "foo" in i and "bar" in i]
[('a', 'foo', 'bar'), ('foo', 'bar', 'crazycoder')]
# look for a perfect 3gram
>> perfect = [i fof i in trigrams if "foo bar crazycoder".split() == i]
[('foo', 'bar', 'crazycoder')]
Upvotes: 0
Reputation: 63727
Generally to create a nested dictionary of trigrams the already posted solutions might work. If you would like to extend the idea for a more generalized solution, you can do one of the following, one of which is adopted from Perl's AutoVivification and the other using collection.defaultdict.
Solution 1:
class ngram(dict):
"""Based on perl's autovivification feature."""
def __getitem__(self, item):
try:
return super(ngram, self).__getitem__(item)
except KeyError:
value = self[item] = type(self)()
return value
Solution 2:
from collections import defaultdict
class ngram(defaultdict):
def __init__(self):
super(ngram, self).__init__(ngram)
Demo using Solution 1
>>> trigram = ngram()
>>> trigram['two']['three']['four'] = 4
>>> trigram
{'two': {'three': {'four': 4}}}
>>> a['two']
{'three': {'four': 4}}
>>> a['two']['three']
{'four': 4}
>>> a['two']['three']['four']
4
Demo using Solution 2
>>> a = ngram()
>>> a['two']['three']['four'] = 4
>>> a
defaultdict(<class '__main__.ngram'>, {'two': defaultdict(<class '__main__.ngram'>, {'three': defaultdict(<class '__main__.ngram'>, {'four': 4})})})
Upvotes: 6