connor449
connor449

Reputation: 1679

Best way to feed text documents that contain labeled utterances into a deep learning model

I am building a text classification model in tensorflow (experimenting with different architectures from BiLSTM to 1DConvnet, etc.) My data is structured as follows:

1 corpus of documents

~ 100 documents made of independent but contextually similar multi-party conversation transcriptions (time series).

~ 200 utterances per document that are labeled (same labeling convention for all documents

In other words, it looks like this (label structure looks the same, but with one int per string):

data = [
       [
       'hello how are you'
       'i am good'
       'whats the weather today'
       ...,
       ],
       [
       'how long have you had that cough'
       'roughly 2 weeks'
       'anything else'
       ...,
       ],
       ...,
       ]
       

Right now, I am feeding my data into my models as a flat list of strings (data) and ints (labels) by flattening all documents. This works, but I wonder if this is the best way to handle my data. IIUC, using any kind of RNN means that my model is 'remembering' the previous data. However as each document contains separate conversations, text from document 1 does not effect text from document 2, and so on. Intuitively, as each document is an independent conversation, I want the model to 'remember' what happened in the beginning of a conversation at the end of a conversation, but to 'forget' when moving to the next. Is this intuition correct?

What is the best practice in this scenario? Is there a way to feed in 1 document at a time (i.e. setting batch size to document length?)? Would this make a difference, or is a flat list the way to go?

Thanks.

Upvotes: 2

Views: 153

Answers (1)

Sam H.
Sam H.

Reputation: 4349

It seems you have a collection of dialogues, and want to classify each turn in the dialogue into some number of classes.

A similar, well-studied problem is Dialogue Act Classification. Dialogue act classification is the task of classifying an utterance with respect to the function it serves in a dialogue, i.e. the act the speaker is performing. Dialogue acts are a type of speech acts (for Speech Act Theory, see Austin (1975) and Searle (1969)).

The paper "Dialogue Act Sequence Labeling using Hierarchical encoder with CRF"](https://arxiv.org/abs/1709.04250) has code available: GitHub. It is academic code, and not the clearest. It is unclear what version of TF they use.

RE: batch size - they use batchSize = 2 (line). The dialogues have variable length utterances.

I think you should read the paper though, there are lots of relevant quotes, like

We propose a hierarchical recurrent encoder, where the first encoder operates at the utterance level, encoding each word in each utterance, and the second encoder operates at the conversation level, encoding each utterance in the conversation, based on the representations of the previous encoder. These two encoders make sure that the output of the second encoder capture the dependencies among utterances.

Upvotes: 1

Related Questions