Best way to feed text documents that contain labeled utterances into a deep learning model

Question

I am building a text classification model in tensorflow (experimenting with different architectures from BiLSTM to 1DConvnet, etc.) My data is structured as follows:

1 corpus of documents

~ 100 documents made of independent but contextually similar multi-party conversation transcriptions (time series).

~ 200 utterances per document that are labeled (same labeling convention for all documents

In other words, it looks like this (label structure looks the same, but with one int per string):

data = [
       [
       'hello how are you'
       'i am good'
       'whats the weather today'
       ...,
       ],
       [
       'how long have you had that cough'
       'roughly 2 weeks'
       'anything else'
       ...,
       ],
       ...,
       ]

Right now, I am feeding my data into my models as a flat list of strings (data) and ints (labels) by flattening all documents. This works, but I wonder if this is the best way to handle my data. IIUC, using any kind of RNN means that my model is 'remembering' the previous data. However as each document contains separate conversations, text from document 1 does not effect text from document 2, and so on. Intuitively, as each document is an independent conversation, I want the model to 'remember' what happened in the beginning of a conversation at the end of a conversation, but to 'forget' when moving to the next. Is this intuition correct?

What is the best practice in this scenario? Is there a way to feed in 1 document at a time (i.e. setting batch size to document length?)? Would this make a difference, or is a flat list the way to go?

Thanks.

Best way to feed text documents that contain labeled utterances into a deep learning model

Answers (1)

Related Questions