Reputation: 11
So I'm building a huffman compressor/decompressor and I'm having an issue with how the symbols, frequencies, and their corresponding code are printing out. I coded a nodecompare function for the priority queue that will first compare frequencies, then symbols, and as a last resort then the nodeID's (giving priority to the newly created node). Though my output is not what is expected and I can't seem to figure out where exactly the issue in my code is. Please help.
#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <queue>
using namespace std;
struct Node
{
char symbol;
int frequency;
Node *left;
Node *right;
int nodeID;
Node(char s, int f) : symbol(s), frequency(f), left(nullptr), right(nullptr) {}
};
struct NodeCompare
{
bool operator()(Node *n1, Node *n2)
{
if (n1->frequency == n2->frequency) {
if (n1->symbol == n2->symbol) {
return n1->nodeID > n2->nodeID;
}
return n1->symbol > n2->symbol;
}
else return n1->frequency > n2->frequency;
}
};
void buildHuffmanTree(priority_queue<Node *, vector<Node *>, NodeCompare> &pq)
{
static int nodeCounter = 0;
while (pq.size() > 1)
{
Node *left = pq.top();
pq.pop();
Node *right = pq.top();
pq.pop();
Node *parent = new Node('$', left->frequency + right->frequency);
parent->nodeID = ++nodeCounter;
parent->left = left;
parent->right = right;
pq.push(parent);
}
}
void printCodes(Node *root, string code)
{
if (root == nullptr)
{
return;
}
if (root->symbol != '$')
{
cout << "Symbol: " << root->symbol << ", Frequency: " << root->frequency << ", Code: " << code << endl;
}
printCodes(root->left, code + "0");
printCodes(root->right, code + "1");
}
int main() {
string input_file = "input_file.txt";
//cin >> input_file;
ifstream infile(input_file);
vector<Node *> symbols;
char symbol;
int frequency;
priority_queue<Node *, vector<Node *>, NodeCompare> pq;
string inputline;
while (getline(infile, inputline)){
istringstream ss(inputline);
ss >> std::noskipws >> symbol;
ss >> std::skipws >> frequency;
Node *node = new Node(symbol, frequency);
symbols.push_back(node);
pq.push(node);
}
buildHuffmanTree(pq);
Node *root = pq.top();
printCodes(root, "");
}
actual output:
Symbol: 0, Frequency: 2, Code: 000 Symbol: 2, Frequency: 2, Code: 001 Symbol: C, Frequency: 2, Code: 010 Symbol: S, Frequency: 2, Code: 011 Symbol: , Frequency: 3, Code: 100 Symbol: R, Frequency: 1, Code: 1010 Symbol: 6, Frequency: 1, Code: 10110 Symbol: G, Frequency: 1, Code: 10111 Symbol: 3, Frequency: 3, Code: 110 Symbol: I, Frequency: 1, Code: 11100 Symbol: N, Frequency: 1, Code: 11101 Symbol: O, Frequency: 1, Code: 11110 Symbol: P, Frequency: 1, Code: 11111
expected output:
Symbol: 0, Frequency: 2, Code: 000 Symbol: 2, Frequency: 2, Code: 001 Symbol: I, Frequency: 1, Code: 0100 Symbol: N, Frequency: 1, Code: 0101 Symbol: 6, Frequency: 1, Code: 0110 Symbol: G, Frequency: 1, Code: 0111 Symbol: R, Frequency: 1, Code: 1000 Symbol: O, Frequency: 1, Code: 10010 Symbol: P, Frequency: 1, Code: 10011 Symbol: , Frequency: 3, Code: 101 Symbol: 3, Frequency: 3, Code: 110 Symbol: C, Frequency: 2, Code: 1110 Symbol: S, Frequency: 2, Code: 1111
Upvotes: 1
Views: 157
Reputation: 112502
There is nothing wrong with your actual output. Both the actual output and expected output are optimal Huffman codes. Both code the shown frequencies in a total of 76 bits, which is the sum over the symbols of the frequency times the code length. Because this set of frequencies results in ties when deciding on which of the lowest frequencies to combine next (in fact, many ties), there are different, equally valid choices at those steps that result in different trees and different codes. No matter what choices are made, an optimal code is generated that codes the input in a total of 76 bits, every time.
For that particular set of frequencies, seven 1's, four 2's, and two 3's, there are so many ties that there are 16 topologically distinct trees that can arise, depending on how the choices are made. Here are all of those trees:
What's more, for each of those trees, each branch can independently be made 0 on the left and 1 on the right, or 1 on the left and 0 on the right. Every topology has 12 branches (one less than the number of symbols), and so there are 4096 possible assignments of 0's and 1's to the codes for each topology.
In total then, I can generate 65,536 different Huffman codes for that set of frequencies, all of which are valid and all of which are equally optimal. You are showing two of those. There is nothing special about either of them.
There is something special about the last two trees shown above, in that they have a depth (longest code length) of four, whereas all of the other trees have depth five. Some might prefer a minimum depth tree, which neither of the codes you showed are. You can resolve ties to get a minimum depth tree by combining the minimum depth leaves and/or subtrees of the same frequency. (Leaves have depth zero.)
In order to to do that, you would keep track of the depth and sort on it when the frequencies are equal. To do that, I would make symbol
an int
instead of char
, and store the depths of subtrees as negative numbers in symbol
. I.e. -1 for depth one, -2 for depth two, and so on. (Make sure your symbols are always non-negative, so bytes must be 0..255, not -128..127.) Then when the frequencies are equal, prefer the maximum values of symbol
.
In struct Node
, have:
int symbol;
and add to struct Node
:
int depth() {
return left == nullptr ? 0 : -symbol;
}
The new NodeCompare
is:
struct NodeCompare
{
bool operator()(Node *n1, Node *n2)
{
if (n1->frequency != n2->frequency)
return n1->frequency > n2->frequency;
else
return n1->symbol < n2->symbol;
}
};
The depth is updated in buildHuffmanTree()
using:
Node *parent = new Node(-(max(left->depth(), right->depth()) + 1),
left->frequency + right->frequency);
For showing the tree, you didn't need the $
business anyway. Just check one of the pointers:
if (root->left == nullptr)
{
cout << "Symbol: " << (char)(root->symbol) << ", Frequency: " << root->frequency << ", Code: " << code << endl;
}
Now it will construct a minimum depth, but still optimal tree, which in this case is depth four:
Symbol: 0, Frequency: 2, Code: 000 Symbol: O, Frequency: 1, Code: 0010 Symbol: N, Frequency: 1, Code: 0011 Symbol: R, Frequency: 1, Code: 0100 Symbol: P, Frequency: 1, Code: 0101 Symbol: I, Frequency: 1, Code: 0110 Symbol: G, Frequency: 1, Code: 0111 Symbol: 3, Frequency: 3, Code: 100 Symbol: , Frequency: 3, Code: 101 Symbol: 6, Frequency: 1, Code: 1100 Symbol: S, Frequency: 2, Code: 1101 Symbol: C, Frequency: 2, Code: 1110 Symbol: 2, Frequency: 2, Code: 1111
Upvotes: 1
Reputation: 117812
I'd normally suggest using std::tie
for creating comparators, but in this case, I'll settle for two minor changes.
$
symbols, separately. A non-parent node should come after a parent node in the queue given that they have the same frequency.nodeID
s.struct NodeCompare {
bool operator()(const Node *n1, const Node *n2) const {
if (n1->frequency == n2->frequency) {
if (n1->symbol == n2->symbol) {
return n1->nodeID < n2->nodeID;
// ^^^ swapped
}
if (n1->symbol != '$' && n2->symbol == '$') return true;
// if `n1` is _not_ a parent node but `n2` is, then return true
return n1->symbol > n2->symbol;
} else
return n1->frequency > n2->frequency;
}
};
An easier way would be to use a different "parent" character and instead use one that would be sorted in the correct order naturally. That would be the character with the smallest value.
#include <limits>
constexpr char Parent = std::numeric_limits<char>::min(); // -128 probably
Then use Parent
everywhere where you now use '$'
.
With that, you don't need to treat the parent nodes separately and can use the much easier to implement comparator using std::tie
:
#include <tuple>
struct NodeCompare {
bool operator()(const Node *n1, const Node *n2) const {
return std::tie(n1->frequency, n1->symbol, n2->nodeID) >
std::tie(n2->frequency, n2->symbol, n1->nodeID);
// note that the ordering of `nodeID` still needs to be swapped
};
};
If you don't like the idea of having a different ordering rule for nodeID
, but rather have ...
struct NodeCompare {
bool operator()(const Node *n1, const Node *n2) const {
return std::tie(n1->frequency, n1->symbol, n1->nodeID) >
std::tie(n2->frequency, n2->symbol, n2->nodeID);
};
};
... you only need to use a decreasing counter instead of an increasing one:
parent->nodeID = --nodeCounter; // was parent->nodeID = ++nodeCounter;
Upvotes: 1