Reputation: 343
I am trying to understand a Huffman code written in python by 'Rosetta code'. Following is small part from the code.
def encode(symb2freq):
heap = [[wt, [sym, ""]] for sym, wt in symb2freq.items()] #What does this do?
I'm assuming variable heap
is a list. But what is wt
and sym
?
Upvotes: 0
Views: 998
Reputation: 9796
sym2freq
is a dictionary with keys some symbols and their values the symbol's frequency. For example, if you have the string 'aaabacba', the dictionary will look like this
sym2freq = {'a': 5, 'b': 2, 'c': 1}
That's because we have 5 times the letter a, 2 times the letter b and 1 time the letter c.
Dictionaries have the method items()
, which will return a tuple of each key with its respective value. In our case, it would return
>>> sym2freq.items()
(('a', 5), ('b', 2), ('c', 1))
The for sym, wt in symb2freq.items()
part of the comprehension list is just unpacking. Every time we fetch one of our tuples from above, we assign the first object to the variable sym
and the second to the variable wt
. The names sym and wt were chosen purely to be reflective of the values the represent, i.e. symbol and weight (frequency).
>>> sym, wt = ('a', 5)
>>> print sym
'a'
>>> print wt
5
And since the list comprehension will create lists of the structure [wt, [sym, ""]]
, you'll end up with the list
>>> encode(sym2freq)
[[5, ['a', '']], [2, ['b', '']], [1, ['c', '']]]
The reason why we go from a dictionary of symbol frequencies to a structure like the list heap
is so we can sort out symbols in terms of their frequencies, which, as you are learning, is part of constructing the Huffman tree.
Upvotes: 4
Reputation: 3459
This is an interesting problem. Imagine a dictionary with five letters where the key is the letter and the value is the frequency:
a = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
then mapping it so that it's a list of lists with frequency, and a list starting with the letter into the heap variable like this:
[[1, ['a', '']], [2, ['b', '']], [3, ['c', '']], [4, ['d', '']], [5, ['e', '']]]
then when you heapify it, you get:
[[1, ['a', '']],
[3, ['c', '']],
[2, ['b', '']],
[5, ['e', '']],
[4, ['d', '']]]
Just work things out step by step with a valid dictionary to encode like the one given on the wikipedia page and you'll be able to see what each step does. Or use a trivial example to start like the one I gave and slowly grow the number of elements until you're working with a real document.
What the rosetta code is doing is iterating through the heap and adding digits to the blank string you see after the letter based on frequency, until each letter is mapped to a binary interpretation. So in the end you'd having something like this:
[['e', '101'],
['d', '010'],
['c', '1001'],
['b', '1100'],
['a', '1101']]
where the most frequent letters require the fewest bits.
Upvotes: -1
Reputation: 10647
This is a list comprehension. This is saying
symb2freq
and start to loop over them.symb2freq
and unpack them into variables sym
and wt
.[wt, [sym, ""]]
to the listheap
For example, [bar(x) for x in foo]
makes a list that applies bar(x)
to each value in the list.
Upvotes: 3
Reputation: 647
sym is symbol, and wt is weight.
https://en.wikipedia.org/wiki/Huffman_coding
Upvotes: -2