Reputation: 183
I am new to CNTK. My environment is C# (unfortunately, I am not a python or a BrainScript programmer).
I am trying to use CNTK to design/train/test an LSTM on free text (NLP) to select an appropriate title (from a given set of titles, about 8,000 of them in my data).
I've used a separate program to map each word into a 100-element vector of real numbers (the 100 is a configurable value; my non-CNTK program, GloVe, can generate any width I select).
My raw input looks something like:
|label 17 |features the brown fox jumped over the ...
|label 19 |features there comes a time when all ...
...
Where '17' is a shorthand for the 17-th title and really is a hot-one representation: [0, 0, ..., 1, 0, 0, ...] where the '1' is in the 17-th position.
Each input row is a sequence of words (separated by a space) - the typical length is a few hundred words, but some data (rows) have thousands of words in it.
My issue is that I don't know how to insert a run-time transformation from my raw file format into something CNTK could use.
I can't assume in-memory data since in production we will be training on data that has millions of rows.
In each mini batch:
The '17' (in the example above) needs to be translated to [0, ..., 1, 0, ...].
Each word needs to be translated (via a lookup into C# Dictionary) into an array (of 100) real numbers.
I realize this is the Embedding layer in CNTK's LSTM but I cannot find any tutorial/example (especially in C#) of how to add a transformation layer using a non-hot-one embedding.
For all its worth, my template for doing this in C# is the LSTMSequenceClassifier.cs in the CNTK examples.
Link to CNTK example: https://github.com/Microsoft/CNTK/blob/master/Examples/TrainingCSharp/Common/LSTMSequenceClassifier.cs
Any help would be greatly appreciated. I've racked my brains on this for the past week!
Upvotes: 3
Views: 1412
Reputation: 33
Since I am new to this too , but I felt that I rather use the most raw data format instead of looking to more advanced features of CNTK..
My approach for this would be
0 |feature 43:1 |label 0 0 1 0
0 |feature 23:1
0 |feature 15:1
0 |feature 34:1
1 |feature 37:1 |label 0 0 0 1
1 |feature 67:1
1 |feature 69:1
1 |feature 12:1
where the max classification is 4 different classifications
and
43rd word in your matrix is "the"
23rd is "fox"
so on and so forth
refer https://cntk.ai/pythondocs/sequence.html
I know reading python can be confusing the first time but you will get the hang of it
https://github.com/Microsoft/CNTK/tree/master/Tests/EndToEndTests/Text/SequenceClassification/Data
has the data files that is used in the example tutorial they have
Upvotes: 0