david marcus
david marcus

Reputation: 183

How to: CNTK C# LSTM classifier of free text (NLP) using word Word2Vec embeddings

I am new to CNTK. My environment is C# (unfortunately, I am not a python or a BrainScript programmer).

I am trying to use CNTK to design/train/test an LSTM on free text (NLP) to select an appropriate title (from a given set of titles, about 8,000 of them in my data).

I've used a separate program to map each word into a 100-element vector of real numbers (the 100 is a configurable value; my non-CNTK program, GloVe, can generate any width I select).

My raw input looks something like:

|label 17 |features the brown fox jumped over the ...
|label 19 |features there comes a time when all ...
...

Where '17' is a shorthand for the 17-th title and really is a hot-one representation: [0, 0, ..., 1, 0, 0, ...] where the '1' is in the 17-th position.

Each input row is a sequence of words (separated by a space) - the typical length is a few hundred words, but some data (rows) have thousands of words in it.

My issue is that I don't know how to insert a run-time transformation from my raw file format into something CNTK could use.

I can't assume in-memory data since in production we will be training on data that has millions of rows.

In each mini batch:

The '17' (in the example above) needs to be translated to [0, ..., 1, 0, ...].

Each word needs to be translated (via a lookup into C# Dictionary) into an array (of 100) real numbers.

I realize this is the Embedding layer in CNTK's LSTM but I cannot find any tutorial/example (especially in C#) of how to add a transformation layer using a non-hot-one embedding.

For all its worth, my template for doing this in C# is the LSTMSequenceClassifier.cs in the CNTK examples.

Link to CNTK example: https://github.com/Microsoft/CNTK/blob/master/Examples/TrainingCSharp/Common/LSTMSequenceClassifier.cs

Any help would be greatly appreciated. I've racked my brains on this for the past week!

Upvotes: 3

Views: 1412

Answers (1)

zero core
zero core

Reputation: 33

Since I am new to this too , but I felt that I rather use the most raw data format instead of looking to more advanced features of CNTK..

My approach for this would be

0   |feature  43:1   |label 0 0 1 0  
0   |feature  23:1   
0   |feature  15:1   
0   |feature  34:1   
1   |feature  37:1   |label 0 0 0 1  
1   |feature  67:1   
1   |feature  69:1   
1   |feature  12:1   

where the max classification is 4 different classifications

and

43rd word in your matrix is "the"
23rd is "fox"

so on and so forth

refer https://cntk.ai/pythondocs/sequence.html

I know reading python can be confusing the first time but you will get the hang of it

https://github.com/Microsoft/CNTK/tree/master/Tests/EndToEndTests/Text/SequenceClassification/Data

has the data files that is used in the example tutorial they have

Upvotes: 0

Related Questions