Mithrand1r
Mithrand1r

Reputation: 2353

NLP for multi feature data set using TensorFlow

I am just a beginner in this subject, I have tested some NN for image recognition as well as using NLP for sequence classification.

This second topic is interesting for me. using

sentences = [
  'some test sentence',
  'and the second sentence'
]
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
sentences = tokenizer.texts_to_sequences(sentences)

will result with an array of size [n,1] where n is word size in sentence. And assuming I have implemented padding correctly each Training example in set will be size of [n,1] where n is the max sentence length.

that prepared training set I can pass into keras model.fit

what when I have multiple features in my data set? let's say I would like to build an event prioritization algorithm and my data structure would look like:

[event_description, event_category, event_location, label]

trying to tokenize such array would result in [n,m] matrix where n is maximum word length and m is the feature number

how to prepare such a dataset so a model could be trained on such data?

would this approach be ok:

# Going through training set to get all features into specific ararys
for data in dataset:
  training_sentence.append(data['event_description'])
  training_category.append(data['event_category'])
  training_location.append(data['event_location'])
  training_labels.append(data['label'])

# Tokenize each array which contains tokenized value 
tokenizer.fit_on_texts(training_sentence)
tokenizer.fit_on_texts(training_category)
tokenizer.fit_on_texts(training_location)
sequences = tokenizer.texts_to_sequences(training_sentence)
categories = tokenizer.texts_to_sequences(training_category)
locations = tokenizer.texts_to_sequences(training_location)

# Concatenating arrays with features into one
training_example = numpy.concatenate([sequences,categories, locations])

#ommiting model definition, training the model
model.fit(training_example, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))

I haven't been testing it yet. I just want to make sure if I understand everything correctly and if my assumptions are correct.

Is this a correct approach to create NPL using NN?

Upvotes: 1

Views: 1253

Answers (1)

mcskinner
mcskinner

Reputation: 2748

I know of two common ways to manage multiple input sequences, and your approach lands somewhere between them.


One approach is to design a multi-input model with each of your text columns as a different input. They can share the vocabulary and/or embedding layer later, but for now you still need a distinct input sub-model for each of description, category, etc.

Each of these becomes an input to the network, using the Model(inputs=[...], outputs=rest_of_nn) syntax. You will need to design rest_of_nn so it can take multiple inputs. This can be as simple as your current concatenation, or you could use additional layers to do the synthesis.

It could look something like this:

# Build separate vocabularies. This could be shared.
desc_tokenizer = Tokenizer()
desc_tokenizer.fit_on_texts(training_sentence)
desc_vocab_size = len(desc_tokenizer.word_index)

categ_tokenizer = Tokenizer()
categ_tokenizer.fit_on_texts(training_category)
categ_vocab_size = len(categ_tokenizer.word_index)

# Inputs.
desc = Input(shape=(desc_maxlen,))
categ = Input(shape=(categ_maxlen,))

# Input encodings, opting for different embeddings.
# Descriptions go through an LSTM as a demo of extra processing.
embedded_desc = Embedding(desc_vocab_size, desc_embed_size, input_length=desc_maxlen)(desc)
encoded_desc = LSTM(categ_embed_size, return_sequences=True)(embedded_desc)
encoded_categ = Embedding(categ_vocab_size, categ_embed_size, input_length=categ_maxlen)(categ)

# Rest of the NN, which knows how to put everything together to get an output.
merged = concatenate([encoded_desc, encoded_categ], axis=1)
rest_of_nn = Dense(hidden_size, activation='relu')(merged)
rest_of_nn = Flatten()(rest_of_nn)
rest_of_nn = Dense(output_size, activation='softmax')(rest_of_nn)

# Create the model, assuming some sort of classification problem.
model = Model(inputs=[desc, categ], outputs=rest_of_nn)
model.compile(optimizer='adam', loss=K.categorical_crossentropy)

The second approach is to concatenate all of your data before encoding it, and then treat everything as a more standard single-sequence problem after that. It is common to use a unique token to separate or define the different fields, similar to BOS and EOS for the beginning and end of the sequence.

It would look something like this:

XXBOS XXDESC This event will be fun. XXCATEG leisure XXLOC Seattle, WA XXEOS

You can also do end tags for the fields like DESCXX, omit the BOS and EOS tokens, and generally mix and match however you want. You can even use this to combine some of your input sequences, but then use a multi-input model as above to merge the rest.


Speaking of mixing and matching, you also have the option to treat some of your inputs directly as an embedding. Low-cardinality fields like category and location do not need to be tokenized, and can be embedded directly without any need to split into tokens. That is, they don't need to be a sequence.

If you are looking for a reference, I enjoyed this paper on Large Scale Product Categorization using Structured and Unstructured Attributes. It tests all or most of the ideas I have just outlined, on real data at scale.

Upvotes: 1

Related Questions