johnco3
johnco3

Reputation: 2652

using stl to run length encode a string using std::adjacent_find

I am trying to perform run length compression on a string for a special protocol that I am using. Runs are considered efficient when the run size or a particular character in the string is >=3. Can someone help me to achieve this. I have live demo on coliru. I am pretty sure this is possible with the standard library's std::adjacent_find with a combination of std::not_equal_to<> as the binary predicate to search for run boundaries and probably using std::equal_to<> once I find a boundary. Here is what I have so far but I am having trouble with the results:

Given the following input text string containing runs or spaces and other characters (in this case runs of the letter 's':

"---thisssss---is-a---tesst--"

I am trying to convert the above text string into a vector containing elements that are either pure runs of > 2 characters or mixed characters. The results are almost correct but not quite and I cannot spot the error.

g++ -std=c++14 -O2 -Wall -pedantic -pthread main.cpp && ./a.out
expected the following
======================
---,thi,sssss,---,is-a,---,tesst--,
actual results
==============
---,thi,sssss,---,is-a,---,te,ss,--,

EDIT: I fixed up the previous code to make this version closer to the final solution. Specifically I added explicit tests for the run size to be > 2 to be included. I seem to be having boundary case problems though - the all spaces case and the case where the end of the strings ends in several spaces:

#include <iterator>
#include <iostream>
#include <memory>
#include <string>
#include <vector>
#include <algorithm>
#include <functional>

int main()
{
    // I want to convert this string containing adjacent runs of characters
    std::string testString("---thisssss---is-a---tesst--");

    // to the following 
    std::vector<std::string> idealResults = {
        "---", "thi", "sssss",
        "---", "is-a",
        "---", "tesst--"
    };

    std::vector<std::string> tokenizedStrings;
    auto adjIter = testString.begin();
    auto lastIter = adjIter;
    // temporary string used to accumulate characters that 
    // are not part of a run.
    std::unique_ptr<std::string> stringWithoutRun;
    while ((adjIter = std::adjacent_find(
        adjIter, testString.end(), std::not_equal_to<>())) !=
        testString.end()) {
        auto next = std::string(lastIter, adjIter + 1);
        // append to foo if < run threshold
        if (next.length() < 2) {
            if (!stringWithoutRun) {
                stringWithoutRun = std::make_unique<std::string>();
            }
            *stringWithoutRun += next;
        } else {
            // if we have encountered non run characters, save them first
            if (stringWithoutRun) {
                tokenizedStrings.push_back(*stringWithoutRun);
                stringWithoutRun.reset();
            }
            tokenizedStrings.push_back(next);
        }
        lastIter = adjIter + 1;
        adjIter = adjIter + 1;
    }
    tokenizedStrings.push_back(std::string(lastIter, adjIter));

    std::cout << "expected the following" << std::endl;
    std::cout << "======================" << std::endl;
    std::copy(idealResults.begin(), idealResults.end(), std::ostream_iterator<std::string>(std::cout, ","));
    std::cout << std::endl;

    std::cout << "actual results" << std::endl;
    std::cout << "==============" << std::endl;
    std::copy(tokenizedStrings.begin(), tokenizedStrings.end(), std::ostream_iterator<std::string>(std::cout, ","));
    std::cout << std::endl;
}

Upvotes: 0

Views: 187

Answers (1)

Anton
Anton

Reputation: 3203

if (next.length() < 2) {
    if (!stringWithoutRun) {
        stringWithoutRun = std::make_unique<std::string>();
    }
    *stringWithoutRun += next;
}

This should be if (next.length() <= 2). You need to add a run of identical characters to the current token if its length is either 1 or 2.

I seem to be having boundary case problems though - the all spaces case and the case where the end of the strings ends in several spaces

When stringWithoutRun is not empty after the loop finishes, the characters accumulated in it are not added to the array of tokens. You can fix it like this:

// The loop has finished
if (stringWithoutRun)
    tokenizedStrings.push_back(*stringWithoutRun);
tokenizedStrings.push_back(std::string(lastIter, adjIter));

Upvotes: 1

Related Questions