audace
audace

Reputation: 11

Issue with implementing the compare function in a priority queue

So I'm building a huffman compressor/decompressor and I'm having an issue with how the symbols, frequencies, and their corresponding code are printing out. I coded a nodecompare function for the priority queue that will first compare frequencies, then symbols, and as a last resort then the nodeID's (giving priority to the newly created node). Though my output is not what is expected and I can't seem to figure out where exactly the issue in my code is. Please help.

#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <queue>
using namespace std;

struct Node
{
    char symbol;
    int frequency;
    Node *left;
    Node *right;
    int nodeID;

   Node(char s, int f) : symbol(s), frequency(f), left(nullptr), right(nullptr) {}
};

struct NodeCompare
{
    bool operator()(Node *n1, Node *n2)
    {
        if (n1->frequency == n2->frequency) {
            if (n1->symbol == n2->symbol) {
                return n1->nodeID > n2->nodeID;
            }
            return n1->symbol > n2->symbol;
        }
        else return n1->frequency > n2->frequency;
    }
};

void buildHuffmanTree(priority_queue<Node *, vector<Node *>, NodeCompare> &pq)
{
    static int nodeCounter = 0;

    while (pq.size() > 1)
    {
        Node *left = pq.top();
        pq.pop();
        Node *right = pq.top();
        pq.pop();
        Node *parent = new Node('$', left->frequency + right->frequency);
        parent->nodeID = ++nodeCounter;
        parent->left = left;
        parent->right = right;
        pq.push(parent);
    }
}

void printCodes(Node *root, string code)
{
    if (root == nullptr)
    {
        return;
    }
    if (root->symbol != '$')
    {
        cout << "Symbol: " << root->symbol << ", Frequency: " << root->frequency << ", Code: " << code << endl;
    }
    printCodes(root->left, code + "0");
    printCodes(root->right, code + "1");
}

int main() {
    string input_file = "input_file.txt";
    //cin >> input_file;

    ifstream infile(input_file);

    vector<Node *> symbols;
    char symbol;
    int frequency;
    priority_queue<Node *, vector<Node *>, NodeCompare> pq;

    string inputline;

    while (getline(infile, inputline)){
        istringstream ss(inputline);
        ss >> std::noskipws >> symbol;
        ss >> std::skipws >> frequency;

        Node *node = new Node(symbol, frequency);
        symbols.push_back(node);
        pq.push(node);
    }

    buildHuffmanTree(pq);
    Node *root = pq.top();
    printCodes(root, "");
}

actual output:

Symbol: 0, Frequency: 2, Code: 000
Symbol: 2, Frequency: 2, Code: 001
Symbol: C, Frequency: 2, Code: 010
Symbol: S, Frequency: 2, Code: 011
Symbol:  , Frequency: 3, Code: 100
Symbol: R, Frequency: 1, Code: 1010
Symbol: 6, Frequency: 1, Code: 10110
Symbol: G, Frequency: 1, Code: 10111
Symbol: 3, Frequency: 3, Code: 110
Symbol: I, Frequency: 1, Code: 11100
Symbol: N, Frequency: 1, Code: 11101
Symbol: O, Frequency: 1, Code: 11110
Symbol: P, Frequency: 1, Code: 11111

expected output:

Symbol: 0, Frequency: 2, Code: 000
Symbol: 2, Frequency: 2, Code: 001
Symbol: I, Frequency: 1, Code: 0100
Symbol: N, Frequency: 1, Code: 0101
Symbol: 6, Frequency: 1, Code: 0110
Symbol: G, Frequency: 1, Code: 0111
Symbol: R, Frequency: 1, Code: 1000
Symbol: O, Frequency: 1, Code: 10010
Symbol: P, Frequency: 1, Code: 10011
Symbol:  , Frequency: 3, Code: 101
Symbol: 3, Frequency: 3, Code: 110
Symbol: C, Frequency: 2, Code: 1110
Symbol: S, Frequency: 2, Code: 1111

Upvotes: 1

Views: 157

Answers (2)

Mark Adler
Mark Adler

Reputation: 112502

There is nothing wrong with your actual output. Both the actual output and expected output are optimal Huffman codes. Both code the shown frequencies in a total of 76 bits, which is the sum over the symbols of the frequency times the code length. Because this set of frequencies results in ties when deciding on which of the lowest frequencies to combine next (in fact, many ties), there are different, equally valid choices at those steps that result in different trees and different codes. No matter what choices are made, an optimal code is generated that codes the input in a total of 76 bits, every time.

For that particular set of frequencies, seven 1's, four 2's, and two 3's, there are so many ties that there are 16 topologically distinct trees that can arise, depending on how the choices are made. Here are all of those trees:

16 trees

What's more, for each of those trees, each branch can independently be made 0 on the left and 1 on the right, or 1 on the left and 0 on the right. Every topology has 12 branches (one less than the number of symbols), and so there are 4096 possible assignments of 0's and 1's to the codes for each topology.

In total then, I can generate 65,536 different Huffman codes for that set of frequencies, all of which are valid and all of which are equally optimal. You are showing two of those. There is nothing special about either of them.

There is something special about the last two trees shown above, in that they have a depth (longest code length) of four, whereas all of the other trees have depth five. Some might prefer a minimum depth tree, which neither of the codes you showed are. You can resolve ties to get a minimum depth tree by combining the minimum depth leaves and/or subtrees of the same frequency. (Leaves have depth zero.)

In order to to do that, you would keep track of the depth and sort on it when the frequencies are equal. To do that, I would make symbol an int instead of char, and store the depths of subtrees as negative numbers in symbol. I.e. -1 for depth one, -2 for depth two, and so on. (Make sure your symbols are always non-negative, so bytes must be 0..255, not -128..127.) Then when the frequencies are equal, prefer the maximum values of symbol.

In struct Node, have:

    int symbol;

and add to struct Node:

    int depth() {
        return left == nullptr ? 0 : -symbol;
    }

The new NodeCompare is:

struct NodeCompare
{
    bool operator()(Node *n1, Node *n2)
    {
        if (n1->frequency != n2->frequency)
            return n1->frequency > n2->frequency;
        else
            return n1->symbol < n2->symbol;
    }
};

The depth is updated in buildHuffmanTree() using:

        Node *parent = new Node(-(max(left->depth(), right->depth()) + 1),
                                left->frequency + right->frequency);

For showing the tree, you didn't need the $ business anyway. Just check one of the pointers:

    if (root->left == nullptr)
    {
        cout << "Symbol: " << (char)(root->symbol) << ", Frequency: " << root->frequency << ", Code: " << code << endl;
    }

Now it will construct a minimum depth, but still optimal tree, which in this case is depth four:

Symbol: 0, Frequency: 2, Code: 000
Symbol: O, Frequency: 1, Code: 0010
Symbol: N, Frequency: 1, Code: 0011
Symbol: R, Frequency: 1, Code: 0100
Symbol: P, Frequency: 1, Code: 0101
Symbol: I, Frequency: 1, Code: 0110
Symbol: G, Frequency: 1, Code: 0111
Symbol: 3, Frequency: 3, Code: 100
Symbol:  , Frequency: 3, Code: 101
Symbol: 6, Frequency: 1, Code: 1100
Symbol: S, Frequency: 2, Code: 1101
Symbol: C, Frequency: 2, Code: 1110
Symbol: 2, Frequency: 2, Code: 1111

Upvotes: 1

Ted Lyngmo
Ted Lyngmo

Reputation: 117812

I'd normally suggest using std::tie for creating comparators, but in this case, I'll settle for two minor changes.

  • You need to treat the parent nodes, the $ symbols, separately. A non-parent node should come after a parent node in the queue given that they have the same frequency.
  • You need to swap the priority order of the nodeIDs.
struct NodeCompare {
    bool operator()(const Node *n1, const Node *n2) const {
        if (n1->frequency == n2->frequency) {
            if (n1->symbol == n2->symbol) {
                return n1->nodeID < n2->nodeID;
//                               ^^^ swapped
            }
            if (n1->symbol != '$' && n2->symbol == '$') return true;
// if `n1` is _not_ a parent node but `n2` is, then return true
            return n1->symbol > n2->symbol;
        } else
            return n1->frequency > n2->frequency;
    }
};

Demo


An easier way would be to use a different "parent" character and instead use one that would be sorted in the correct order naturally. That would be the character with the smallest value.

#include <limits>

constexpr char Parent = std::numeric_limits<char>::min(); // -128 probably

Then use Parent everywhere where you now use '$'.

With that, you don't need to treat the parent nodes separately and can use the much easier to implement comparator using std::tie:

#include <tuple>

struct NodeCompare {
    bool operator()(const Node *n1, const Node *n2) const {
        return std::tie(n1->frequency, n1->symbol, n2->nodeID) >
               std::tie(n2->frequency, n2->symbol, n1->nodeID);
               // note that the ordering of `nodeID` still needs to be swapped
    };
};

Demo


If you don't like the idea of having a different ordering rule for nodeID, but rather have ...

struct NodeCompare {
    bool operator()(const Node *n1, const Node *n2) const {
        return std::tie(n1->frequency, n1->symbol, n1->nodeID) >
               std::tie(n2->frequency, n2->symbol, n2->nodeID);
    };
};

... you only need to use a decreasing counter instead of an increasing one:

parent->nodeID = --nodeCounter; // was parent->nodeID = ++nodeCounter;

Upvotes: 1

Related Questions