Dycey
Dycey

Reputation: 4705

Generate “hash” functions programmatically

I have some extremely old legacy procedural code which takes 10 or so enumerated inputs [ i0, i1, i2, ... i9 ] and generates 170 odd enumerated outputs [ r0, r1, ... r168, r169 ]. By enumerated, I mean that each individual input & output has its own set of distinct value sets e.g. [ red, green, yellow ] or [ yes, no ] etc.

I’m putting together the entire state table using the existing code, and instead of puzzling through them by hand, I was wondering if there was an algorithmic way of determining an appropriate function to get to each result from the 10 inputs. Note, not all input columns may be required to determine an individual output column, i.e. r124 might only be dependent on i5, i6 and i9.

These are not continuous functions, and I expect I might end up with some sort of hashing function approach, but I wondered if anyone knew of a more repeatable process I should be using instead? (If only there was some Karnaugh map like approach for multiple value non-binary functions ;-) )

Upvotes: 3

Views: 259

Answers (2)

Eugene D. Gubenkov
Eugene D. Gubenkov

Reputation: 5357

Algorithm is pretty straightforward. Given possible values for each input we can generate all the input vectors possible. Then per each output we can just eliminate these inputs that do no matter for the output. As the result we for each output we can get a matrix showing output values for all the input combinations excluding the inputs that do not matter for given output.

Sample input format (for code snipped below):

var schema = new ConvertionSchema()
{
    InputPossibleValues = new object[][]
    {
        new object[] { 1, 2, 3, }, // input #0
        new object[] { 'a', 'b', 'c' }, // input #1
        new object[] { "foo", "bar" }, // input #2
    },
    Converters = new System.Func<object[], object>[]
    {
        input => input[0], // output #0
        input => (int)input[0] + (int)(char)input[1], // output #1
        input => (string)input[2] == "foo" ? 1 : 42, // output #2
        input => input[2].ToString() + input[1].ToString(), // output #3
        input => (int)input[0] % 2, // output #4
    }
};

Sample output:

sample out

Leaving the heart of the backward conversion below. Full code in a form of Linqpad snippet is there: http://share.linqpad.net/cknrte.linq.

public void Reverse(ConvertionSchema schema)
{
    // generate all possible input vectors and record the resul for each case
    // then for each output we could figure out which inputs matters

    object[][] inputs = schema.GenerateInputVectors();

    // reversal path
    for (int outputIdx = 0; outputIdx < schema.OutputsCount; outputIdx++)
    {
        List<int> inputsThatDoNotMatter = new List<int>();

        for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
        {
            // find all groups for input vectors where all other inputs (excluding current) are the same
            // if across these groups outputs are exactly the same, then it means that current input
            // does not matter for given output

            bool inputMatters = inputs.GroupBy(input => ExcudeByIndexes(input, new[] { inputIdx }), input => schema.Convert(input)[outputIdx], ObjectsByValuesComparer.Instance)
                .Where(x => x.Distinct().Count() > 1)
                .Any();

            if (!inputMatters)
            {
                inputsThatDoNotMatter.Add(inputIdx);
                Util.Metatext($"Input #{inputIdx} does not matter for output #{outputIdx}").Dump();
            }
        }

        // mapping table (only inputs that matters)
        var mapping = new List<dynamic>();

        foreach (var inputGroup in inputs.GroupBy(input => ExcudeByIndexes(input, inputsThatDoNotMatter), ObjectsByValuesComparer.Instance))
        {
            dynamic record = new ExpandoObject();

            object[] sampleInput = inputGroup.First();

            object output = schema.Convert(sampleInput)[outputIdx];

            for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
            {
                if (inputsThatDoNotMatter.Contains(inputIdx))
                    continue;

                AddProperty(record, $"Input #{inputIdx}", sampleInput[inputIdx]);

            }

            AddProperty(record, $"Output #{outputIdx}", output);

            mapping.Add(record);
        }

        // input x, ..., input y, output z form is needed
        mapping.Dump();
    }
}

Upvotes: 0

btilly
btilly

Reputation: 46435

If you are willing to actually enumerate all possible input/output sequences, here is a theoretical approach to tackle this that should be fairly effective.

First, consider the entropy of the output. Suppose that you have n possible input sequences, and x[i] is the number of ways to get i as an output. Let p[i] = float(x[i])/float(n[i]) and then the entropy is - sum(p[i] * log(p[i]) for i in outputs). (Note, since p[i] < 1 the log(p[i]) is a negative number, and therefore the entropy is positive. Also note, if p[i] = 0 then we assume that p[i] * log(p[i]) is also zero.)

The amount of entropy can be thought of as the amount of information needed to predict the outcome.

Now here is the key question. What variable gives us the most information about the output per information about the input?

If a particular variable v has in[v] possible values, the amount of information in specifying v is log(float(in[v])). I already described how to calculate the entropy of the entire set of outputs. For each possible value of v we can calculate the entropy of the entire set of outputs for that value of v. The amount of information given by knowing v is the entropy of the total set minus the average of the entropies for the individual values of v.

Pick the variable v which gives you the best ratio of information_gained_from_v/information_to_specify_v. Your algorithm will start with a switch on the set of values of that variable.

Then for each value, you repeat this process to get cascading nested if conditions.

This will generally lead to a fairly compact set of cascading nested if conditions that will focus on the input variables that tell you as much as possible, as quickly as possible, with as few branches as you can manage.


Now this assumed that you had a comprehensive enumeration. But what if you don't?

The answer to that is that the analysis that I described can be done for a random sample of your possible set of inputs. So if you run your code with, say, 10,000 random inputs, then you'll come up with fairly good entropies for your first level. Repeat with 10,000 each of your branches on your second level, and the same will happen. Continue as long as it is computationally feasible.

If there are good patterns to find, you will quickly find a lot of patterns of the form, "If you put in this that and the other, here is the output you always get." If there is a reasonably short set of nested ifs that give the right output, you're probably going to find it. After that, you have the question of deciding whether to actually verify by hand that each bucket is reliable, or to trust that if you couldn't find any exceptions with 10,000 random inputs, then there are none to be found.


Tricky approach for the validation. If you can find fuzzing software written for your language, run the fuzzing software with the goal of trying to tease out every possible internal execution path for each bucket you find. If the fuzzing software decides that you can't get different answers than the one you think is best from the above approach, then you can probably trust it.

Upvotes: 3

Related Questions