String pattern classifier with Tensorflow

Question

I'm new to Tensorflow and machine learning.

My task is to predict the type of a given string input. Here's an example of the training data (with the output already one-hot encoded):

const training = [
    { x: '622-49-7314', y: [1,0,0,0] },      // "ssn"
    { x: '1234 Elm Street', y: [0,1,0,0] },  // "street-address"
    { x: '(419) 555-5555', y: [0,0,1,0] },   // "phone-number"
    { x: 'Jane Doe', y: [0,0,0,1] },         // "full-name"
    { x: 'José García', y: [0,0,0,1] },      // "full-name"
    // ... and millions more examples...
]

My first problem is how to encode the input, since it's not a typical text-dictionary problem (sequence of words) but rather a variable-size sequence of letters.

I've tried 3 encoding approaches for the input string:

Encoding 1, standard text embeddings:

async function encodeData(data) {
    const sentences = data.map(str => str.toLowerCase());
    const model = await use.load();
    const embeddings = await model.embed(sentences);
    return embeddings;
}

Encoding 2, padded unicode buffers and normalized exponential (softmax):

function encodeStr(str, pad = 512) {
    let arr = Array.from(
        new Int32Array(Buffer.from(str.padEnd(pad, '\0'), 'utf16le'))
    );
    const sum = arr.reduce((t, v) => t + Math.exp(v), 0);
    arr = arr.map(el => Math.exp(el) / sum);
    return arr;
}

Encoding 3, a locality hash, broken down into a hex vector of length 64 and normalized exponential (softmax):

const { Nilsimsa } = require('nilsimsa');
function encodeHash(str) {
    const hash = new Nilsimsa(str).digest('hex'),
        vals = hash.split(/(?<=^(?:.{2})+)(?!$)/).map(el => parseInt(el, 16));

    const sum = vals.reduce((t, v) => t + Math.exp(v), 0),
        normArr = vals.map(el => Math.exp(el) / sum);
    return normArr;
}

Then I used a simple model:

const inputSz = 512; // or 128 for encodeStr, or 32 for encodeHash 
const outputSz = 4; // [0,0,0,0] - the size of the one-hot encoding (potentially could be >1000)

model.add(
    tf.layers.dense({
        inputShape: [inputSz],
        activation: 'softmax',
        units: outputSz
    })
);

model.add(
    tf.layers.dense({
        inputShape: [outputSz],
        activation: 'softmax',
        units: outputSz
    })
);

model.add(
    tf.layers.dense({
        inputShape: [outputSz],
        activation: 'softmax',
        units: outputSz
    })
);

model.compile({
    loss: 'meanSquaredError',
    optimizer: tf.train.adam(0.06)
});

Which is trained as such:

    const trainingTensor = tf.tensor2d( data.map(_ => encodeInput(_.input)));
    const [encodedOut, outputIndex, outSz] = encodeOutput(data.map(_ => _.output));
    const outputData = tf.tensor2d(encodedOut);
    const history = await model.fit(trainingTensor, outputData, { epochs: 50 });

But results are all very poor, averaging loss = 0.165. I've tried different configs using the approaches above, ie. "softmax" and "sigmoid" activations, more or less dense layers, but I just can't figure it out.

What's the best way to encode strings that are not just text?
What's the correct network type and model configuration for this type of classification?

Any help or some direction here would be appreciated as I can't find good examples to base my solution on.

String pattern classifier with Tensorflow

Answers (1)

Related Questions