Optimal algorithm for encoding data within pre-existing data

Question

Say we have some existing random data, uniformly distributed, that's been written to some medium. Lets also say that the writing of 1 bits is a destructive action, while for 0 bits it is non-destructive. This could be analogous to punching holes in a punch card or burning a fuse in a circuit to hard-code a 1. After this initial write has already taken place, lets say we now wish to write additional data to this medium, with some special encoding that will enable us to retrieve this new data in the future, with no special knowledge of pre-existing data that was already destructively written to the medium. The pre-existing data itself does not need to remain retrievable- only the new data.

Also assume that the new data is itself random, or has at least has already been compressed to be effectively random.

Obviously we cannot expect to exceed ~50% of the capacity of the original storage, since only roughly half will be writable. The trouble is trying to push the efficiency as close to this limit as possible.

I have two fairly simple encodings, with seemingly reasonable efficiencies, however they do not appear to be optimal. For the ease of explaining, I will assume the medium is a paper tape with holes (denoting a 1) or gaps (denoting a 0) at regular intervals.

Encoding A

Encode 1 bits with a gap at an even offset along the tape, and 0 bits with a gap at an odd offset.

Offset:    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
Tape:      G  G  H  H  H  H  H  G  H  H  H  G  H  H  H  H  G  H  H  H
           |  |                 |           |              |
New Data:  1  0                 0           0              1

Encoding B

Encode 1 bits with a gap followed by another gap, and 0 bits with a gap followed by a hole.

Offset: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
Tape:   H  G  G  H  H  H  H  G  H  H  H  G  G  H  H  H  G  H  G  H
            \/                \/          \/             \/    \/
New Data:   1                 0           1              0     0

Both of these, on average, can encode 25% of the original storage capacity's worth of data. This seems pretty reasonable, at least to me- only half of the tape to begin with were gaps, and any holes we write lose information (since we cannot know if they pre-existed or not when we later attempt to decode), so a quarter- 25%. Seems at first like it could be optimal.

However, what's bothering me is that it seems like this is not in fact the case. I have a somewhat more complex encoding that consistently breaks this threshold.

Encoding C

At its core, we encode 1 bits as runs of holes of odd length, and 0 bits as runs of holes of even length. So HHG would represent 0 while HG or HHHG would represent 1.

This by itself is not enough to cross the 25% threshold however, so we add one optimization. When we see two successive gaps after a run of holes and its terminating gap, we treat those gaps as a 0. The logic behind this is that if we had wanted to encode a 1, we could have just punched out the first of the two gaps, producing HG, encoding a 1. Since we did not do so, we can assume that we needed to encode a 0 instead.

With this optimization, encoding C reaches a stable 26.666 +/- 0.005 storage efficiency, consistently above the theoretical 25%. I also have a handful of really tricky peephole optimizations that I can tack onto this encoding, which I suspect would push its efficiency up to 28-30%, but it feels like I'm already overthinking this.

Encoding D?

This isn't really an encoding, but one last thought I had, which can improve any of the above encodings. Consider an encoding E(x), and some deterministic + reversible transformation T(x) which mutates the data in some arbitrary fashion. Say we prepend a 0 bit to our data-to-be-encoded, and encode it- E('0' + DATA). Say we also mutate our data, prepend a 1 bit, and then encode it- E('1' + T(DATA)). We could then compare the two, see which happened to encode more data (greater efficiency), and choose that one. The improvement would be small overall, hinging on statistical variation, and we did sacrifice a bit as an indicator, but as long as the data is large enough, the average savings will outweigh the single bit lost. This could be generalized- partitioning the data into a few partitions and choosing between 2 or more encodings, whichever happened to randomly fit best, but that's beside the point. Overall, the improvement would be small, but it should still be an straight improvement- indicating that the base encoding E(x) is not optimal.

To recap- I'm looking for the most efficient (in the average case) lossless encoding for writing data to a medium that's already been semi-destructively (1's only) written to with pure random data. I really believe the optimal solution is somewhere between 30-50% efficiency, but I'm struggling. I hope someone can either share what the optimal encoding is, or shed some light on the relevant literature / theory around this topic.

Side note: In the process of trying to get a better understanding of this problem, I tried to create a less efficient algorithm that bounds the worst-case encoding efficiency anywhere above 0%, but failed. It seems like no matter the encoding I tried, even if half of the storage were guaranteed to be writable, in an astronomically unlikely event: the ordering of the pre-existing data can ensure that we're unable to encode even a single bit of the new data. This isn't really a concern for the actual problem statement, since I'm concerned with the average-case efficiency, but this was unsettling.

David Eisenstat · Accepted Answer

I suspect that the expected capacity approaches 50% in the limit as the number of bits n → ∞.

The encoding algorithm that I have in mind uses linear algebra over the finite field F₂. Ahead of time, choose ε > 0 and a random matrix A of dimension (1 − ε) n/2 × n. To decode a vector x, if x is not all ones, then return the matrix-vector product A (1 − x); otherwise, fail. To encode a vector b, use Gaussian elimination to solve for nonzero x′ in A′ (1 − x′) = b, where A′ omits the columns corresponding to preexisting one bits and x′ omits the rows corresponding to preexisting one bits. If there is no solution, punch a lace card.

I don’t have time to write and verify a formal proof, but my intuition is that, in taking sub-matrices of A, the probability that we encounter linear dependence decreases very fast as ε edges away from zero.

I implemented a simulation of a more practical version of this algorithm in C++ below (which could be extended to an encoder/decoder pair without too much trouble). Instead of being all or nothing, it determines the longest prefix that it can control and uses the block as a 6-bit length followed by that amount of data, realizing ≈38% of the original storage. The length prefix is eating about 9% here (we had control of ≈47% of the bits), but it approaches 0% in the large-n limit. Even with 64-bit blocks, you could do better by Huffman coding the lengths.

(You might wonder what happens if we can’t even control all of the length bits. We can set things up so that the lace card decodes to length 0, which implies that it is skipped, then punch a lace card whenever that has to happen.)

#include 
#include 
#include 
#include 
#include 
#include 
#include 

namespace {

static std::random_device g_dev;

class Vector {
public:
  explicit Vector() = default;
  static Vector Random() {
    return Vector{std::uniform_int_distribution{
        0, std::numeric_limits::max()}(g_dev)};
  }
  Vector &operator^=(const Vector v) {
    bits_ ^= v.bits_;
    return *this;
  }
  bool operator[](const std::size_t i) const { return bits_ & (Bits{1} << i); }

private:
  using Bits = unsigned long long;
  explicit Vector(const Bits bits) : bits_{bits} {}
  Bits bits_ = 0;
};

class Matrix {
public:
  static Matrix Random(const std::size_t num_rows) {
    Matrix a;
    a.rows_.reserve(num_rows);
    for (std::size_t i = 0; i < num_rows; ++i) {
      a.rows_.push_back(Vector::Random());
    }
    return a;
  }

  Matrix RandomSubsetOfRows(const double p) const {
    Matrix a;
    for (const Vector row : rows_) {
      if (std::bernoulli_distribution{p}(g_dev)) {
        a.rows_.push_back(row);
      }
    }
    return a;
  }

  std::size_t NumControllablePrefixBits() && {
    for (std::size_t j = 0; true; ++j) {
      const auto pivot = [&]() -> std::optional {
        for (std::size_t i = j; i < rows_.size(); ++i) {
          if (rows_[i][j]) {
            return i;
          }
        }
        return std::nullopt;
      }();
      if (!pivot) {
        return j;
      }
      std::swap(rows_[j], rows_[*pivot]);
      for (std::size_t i = 0; i < rows_.size(); ++i) {
        if (i != j && rows_[i][j]) {
          rows_[i] ^= rows_[j];
        }
      }
    }
  }

private:
  std::vector rows_;
};

} // namespace

int main() {
  static constexpr std::size_t kBlocks = 10000;
  static constexpr std::size_t kLogBlockSize = 6;
  static constexpr std::size_t kBlockSize = 1 << kLogBlockSize;
  std::size_t num_usable_bits = 0;
  const Matrix a = Matrix::Random(kBlockSize);
  for (int i = 0; i < kBlocks; ++i) {
    num_usable_bits +=
        a.RandomSubsetOfRows(0.5).NumControllablePrefixBits() - kLogBlockSize;
  }
  std::cout << static_cast(num_usable_bits) / (kBlocks * kBlockSize)
            << "
";
}

Optimal algorithm for encoding data within pre-existing data

Encoding A

Encoding B

Encoding C

Encoding D?

Answers (2)

Related Questions