Reputation: 2094
I'm new to Tensorflow and machine learning.
My task is to predict the type of a given string input. Here's an example of the training data (with the output already one-hot encoded):
const training = [
{ x: '622-49-7314', y: [1,0,0,0] }, // "ssn"
{ x: '1234 Elm Street', y: [0,1,0,0] }, // "street-address"
{ x: '(419) 555-5555', y: [0,0,1,0] }, // "phone-number"
{ x: 'Jane Doe', y: [0,0,0,1] }, // "full-name"
{ x: 'José García', y: [0,0,0,1] }, // "full-name"
// ... and millions more examples...
]
My first problem is how to encode the input, since it's not a typical text-dictionary problem (sequence of words) but rather a variable-size sequence of letters.
I've tried 3 encoding approaches for the input string:
Encoding 1, standard text embeddings:
async function encodeData(data) {
const sentences = data.map(str => str.toLowerCase());
const model = await use.load();
const embeddings = await model.embed(sentences);
return embeddings;
}
Encoding 2, padded unicode buffers and normalized exponential (softmax):
function encodeStr(str, pad = 512) {
let arr = Array.from(
new Int32Array(Buffer.from(str.padEnd(pad, '\0'), 'utf16le'))
);
const sum = arr.reduce((t, v) => t + Math.exp(v), 0);
arr = arr.map(el => Math.exp(el) / sum);
return arr;
}
Encoding 3, a locality hash, broken down into a hex vector of length 64 and normalized exponential (softmax):
const { Nilsimsa } = require('nilsimsa');
function encodeHash(str) {
const hash = new Nilsimsa(str).digest('hex'),
vals = hash.split(/(?<=^(?:.{2})+)(?!$)/).map(el => parseInt(el, 16));
const sum = vals.reduce((t, v) => t + Math.exp(v), 0),
normArr = vals.map(el => Math.exp(el) / sum);
return normArr;
}
Then I used a simple model:
const inputSz = 512; // or 128 for encodeStr, or 32 for encodeHash
const outputSz = 4; // [0,0,0,0] - the size of the one-hot encoding (potentially could be >1000)
model.add(
tf.layers.dense({
inputShape: [inputSz],
activation: 'softmax',
units: outputSz
})
);
model.add(
tf.layers.dense({
inputShape: [outputSz],
activation: 'softmax',
units: outputSz
})
);
model.add(
tf.layers.dense({
inputShape: [outputSz],
activation: 'softmax',
units: outputSz
})
);
model.compile({
loss: 'meanSquaredError',
optimizer: tf.train.adam(0.06)
});
Which is trained as such:
const trainingTensor = tf.tensor2d( data.map(_ => encodeInput(_.input)));
const [encodedOut, outputIndex, outSz] = encodeOutput(data.map(_ => _.output));
const outputData = tf.tensor2d(encodedOut);
const history = await model.fit(trainingTensor, outputData, { epochs: 50 });
But results are all very poor, averaging loss = 0.165. I've tried different configs using the approaches above, ie. "softmax" and "sigmoid" activations, more or less dense layers, but I just can't figure it out.
Any help or some direction here would be appreciated as I can't find good examples to base my solution on.
Upvotes: 1
Views: 503
Reputation: 18371
About the model
The softmax
activation returns a probability (value between 0 and 1) and is mostly used as an activation for a last layer for classification problem. The relu
activation can be used instead. Additionnaly for the loss function, the categoricalCrossEntropy
is well suited than the meanSquaredError
.
LSTM and/or bidirectionnal LSTM can be added to the models to take into account the context of the data. If they are used, they will be the first layers of the models so as not to break the context of the data before passing on to dense layers.
About the encoding
Since Nilsimsa
is an algorithmic technique that hashes similar input items into the same "buckets" with high probability, it can also be used for clustering and text classification though I haven't used it myself.
The first encoding tries to keep the distance between words when creating tokens from the sentence.
Encoding the data as binary is less used in NLP. However, in this case, since the classification would need to figure out how many numbers there is between the text to find out the label, the binary encoding can create tensors where the euclidian distance will be high between inputs of different labels.
Last thing but not the least to compare the encoding would be the time taken for creating the tensors from the input string.
Upvotes: 1