Reputation: 1033
According to the doc, Tensorflow Embedding
layer has fixed input_dim
, i.e., vocabulary size.
When we train a DNN model in a streaming fashion (online learning), the number of unique features, i.e., input_dim
is unknown beforehand. It could increase over time as new data comes in. Thus, we cannot declare the Embedding
layer with fixed input_dim
. How shall we handle the embedding in this case?
Thanks in advance!
Upvotes: 1
Views: 805
Reputation: 3773
You are correct, when you declare and embedding layer like so
tf.keras.layers.Embedding(input_dim, output_dim)
you certainly have to pass a static value for input and output dims. The way the embedding is implemented under the covers is a large matrix of size input_dim x output_dim, and then it uses tf.keras.backend.gather
to pull the rows out for the input indexes that you pass.
This doesn't solve your problem, but is a start as insight into how we could fix it.
It is as simple as that. Allocate 20%, 50%, or 200% more output_dim than you need today. Do some estimates of how much you will need before you want to deploy another model. If that is next month, then allocate enough space to get you there (and a little more as a buffer)
Eventually Fix 1 will run out. Create a new model reusing all the previous weights from your previously trained model other than the weights within the Embedding layer. Create a new embedding layer of the size you want next (and add some buffer to that). Remembering that the rows are the input, and the columns are the output, we simply copy the previous embedding matrix into the top of the new embedding matrix.
If our old matrix was an input_dim=300, output_dim=20, that is a 300x20 matrix. And if our new matrix is an input_dim=500, output_dim=20, that is a 500x20 matrix. Copy the first 300 rows of the previous matrix into the initialized 500x20 matrix.
Upvotes: 3