powerPixie
powerPixie

Reputation: 708

BERT output should be text_A + text_B = some classification, but it's doing text_A = some classification and text_b = some classification

I am using a code adapted from Predicting Movie Reviews with BERT on TF Hub.ipynb. I am trying to run a comparison between two sentences to retrieve a result out of them.

Some previous code from "Predicting Movie Reviews with BERT" on TF Hub.ipynb is needed to run the code I am placing. I used small_bert_bert_uncased_L-4_H-768_A-12_1 as the model.

And I think I took a little step towards the solution thanks to Matthew Viglione.

abstracts = []

abstracts.append("Infants understand that people pursue goals, but how do they learn which goals people prefer? We tested whether infants solve this problem by inverting a mental model of action planning, trading off the costs of acting against the rewards actions bring. After seeing an agent attain two goals equally often at varying costs, infants expected the agent to prefer the goal it attained through costlier actions. These expectations held across three experiments that conveyed cost through different physical path features (height, width, and incline angle), suggesting that an abstract variable—such as “force,” “work,” or “effort”—supported infants’ inferences. We modeled infants’ expectations as Bayesian inferences over utility-theoretic calculations, providing a bridge to recent quantitative accounts of action understanding in older children and adults.")
abstracts.append("Our understanding of how diseases spread has greatly benefited from advances in network modeling. However, despite of its importance for disease contagion, the directionality of edges has rarely been taken into account. On the other hand, the introduction of the multilayer framework has made it possible to deal with more complex scenarios in epidemiology such as the interaction between different pathogens or multiple strains of the same disease. In this work, we study in depth the dynamics of disease spreading in directed multilayer networks. Using the generating function approach and numerical simulations of a stochastic susceptible-infected-susceptible model, we calculate the epidemic threshold of synthetic and real-world multilayer systems and show that it is mainly determined by the directionality of the links connecting different layers, regardless of the degree distribution chosen for the layers. Our findings are of utmost interest given the ubiquitous presence of directed multilayer networks and the widespread use of disease-like spreading processes in a broad range of phenomena such as diffusion processes in social and transportation systems.")

def getPrediction(in_sentences):
  labels = ["Negative", "Positive"]
  #input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label = 0) for x in in_sentences] # here, "" is just a dummy label
  input_examples = [run_classifier.InputExample(guid="", text_a = in_sentences[0], text_b = in_sentences[1], label = 1)] #
  input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
  predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=False)
  predictions = estimator.predict(predict_input_fn)
  
  return[(sentence,prediction['probabilities'],labels[prediction['labels']]) for sentence, prediction in [list[x] for x in zip(in_sentences,predictions)]]

The error states:

H:\Users\XXXXX\Anaconda3\envs\tfm\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py in _extract_batch_length(self, preds_evaluated)
   1033     for key, value in six.iteritems(preds_evaluated):
   1034       batch_length = batch_length or value.shape[0]
-> 1035       if value.shape[0] != batch_length:
   1036         raise ValueError('Batch length of predictions should be same. %s has '
   1037                          'different batch length than others.' % key)

IndexError: tuple index out of range

I changed the code a little bit and found an interesting output.

def getPrediction(in_sentences):
  labels = ['N', 'S']
  #input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label = 0) for x in in_sentences] # here, "" is just a dummy label
  input_examples = [run_classifier.InputExample(guid="", text_a = in_sentences[0], text_b = in_sentences[1], label = 0)] #
  input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
  predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=False)
  
  preds = estimator.predict(predict_input_fn,predict_keys=labels)

  return [labels for pred in preds]

When I run:

result = getPrediction(abstracts)

The error is:

H:\Users\XXXXXXX\Anaconda3\envs\tfm\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py in _extract_keys(self, predictions, predict_keys)
   1052     if not predictions:
   1053       raise ValueError('Expected to run at least one output from %s, '
-> 1054                        'provided %s.' % (existing_keys, predict_keys))
   1055     return predictions
   1056 

ValueError: Expected to run at least one output from dict_keys(['probabilities', 'labels']), provided ['N', 'S'].

Upvotes: 0

Views: 494

Answers (2)

powerPixie
powerPixie

Reputation: 708

Finally, the problem is that my example is an one single example and estimator.predict is set to prevent that by default. I tried using a batch made of one single example and no text_b, as you can see:

def getPrediction(in_sentences):
  labels = ["Negative", "Positive"]
  #input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label = 0) for x in in_sentences] # here, "" is just a dummy label
  input_examples = [run_classifier.InputExample(guid="", text_a = in_sentences[0], text_b = in_sentences[1], label = 1)] #
  input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
  predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=False)
  predictions = estimator.predict(predict_input_fn)
  
  return[(sentence,prediction['probabilities'],labels[prediction['labels']]) for sentence, prediction in [list[x] for x in zip(in_sentences,predictions)]]

abstract = "Infants understand that people pursue goals, but how do they learn which goals people prefer? We tested whether infants solve this problem by inverting a mental model of action planning, trading off the costs of acting against the rewards actions bring. After seeing an agent attain two goals equally often at varying costs, infants expected the agent to prefer the goal it attained through costlier actions. These expectations held across three experiments that conveyed cost through different physical path features (height, width, and incline angle), suggesting that an abstract variable—such as “force,” “work,” or “effort”—supported infants’ inferences. We modeled infants’ expectations as Bayesian inferences over utility-theoretic calculations, providing a bridge to recent quantitative accounts of action understanding in older children and adults."

result = getPrediction(abstract)

delivers the error:

H:\Users\XXXXX\Anaconda3\envs\tfm\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py in _extract_batch_length(self, preds_evaluated)
   1033     for key, value in six.iteritems(preds_evaluated):
   1034       batch_length = batch_length or value.shape[0]
-> 1035       if value.shape[0] != batch_length:
   1036         raise ValueError('Batch length of predictions should be same. %s has '
   1037                          'different batch length than others.' % key)

IndexError: tuple index out of range

The solution, in my case (sentence pair classification), is:

abstracts = []

abstracts.append("Infants understand that people pursue goals, but how do they learn which goals people prefer? We tested whether infants solve this problem by inverting a mental model of action planning, trading off the costs of acting against the rewards actions bring. After seeing an agent attain two goals equally often at varying costs, infants expected the agent to prefer the goal it attained through costlier actions. These expectations held across three experiments that conveyed cost through different physical path features (height, width, and incline angle), suggesting that an abstract variable—such as “force,” “work,” or “effort”—supported infants’ inferences. We modeled infants’ expectations as Bayesian inferences over utility-theoretic calculations, providing a bridge to recent quantitative accounts of action understanding in older children and adults.")
abstracts.append("The mammalian immune system implements a remarkably effective set of mechanisms for fighting pathogens. Its main components are haematopoietic immune cells, including myeloid cells that control innate immunity, and lymphoid cells that constitute adaptive immunity. However, immune functions are not unique to haematopoietic cells, and many other cell types display basic mechanisms of pathogen defence. To advance our understanding of immunology outside the haematopoietic system, here we systematically investigate the regulation of immune genes in the three major types of structural cells: epithelium, endothelium and fibroblasts. We characterize these cell types across twelve organs in mice, using cellular phenotyping, transcriptome sequencing, chromatin accessibility profiling and epigenome mapping. This comprehensive dataset revealed complex immune gene activity and regulation in structural cells. The observed patterns were highly organ-specific and seem to modulate the extensive interactions between structural cells and haematopoietic immune cells. Moreover, we identified an epigenetically encoded immune potential in structural cells under tissue homeostasis, which was triggered in response to systemic viral infection. This study highlights the prevalence and organ-specific complexity of immune gene activity in non-haematopoietic structural cells, and it provides a high-resolution, multi-omics atlas of the epigenetic and transcriptional networks that regulate structural cells in the mouse.")

def getPrediction(in_sentences):
  labels = ["Not_Similar", "Similar"]
  input_examples = [run_classifier.InputExample(guid="", text_a = in_sentences[0], text_b = in_sentences[1], label = 0)] # here, "" is just a dummy label
  input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
  predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=False)
  predictions = estimator.predict(predict_input_fn,yield_single_examples=False)
  return [(prediction['probabilities'], labels[prediction['labels']]) for prediction in predictions]

Upvotes: 0

Salvatore
Salvatore

Reputation: 11962

You are only using the first two characters of each sentence.

The last 4 cells in the notebook in this BERT repo show how to use the calssifier to make predictions on sentences:

input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label = 0) for x in in_sentences]  

Full function:

def getPrediction(in_sentences):
  labels = ["Negative", "Positive"]
  input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label = 0) for x in in_sentences] # here, "" is just a dummy label
  input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
  predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=False)
  predictions = estimator.predict(predict_input_fn)
  return [(sentence, prediction['probabilities'], labels[prediction['labels']]) for sentence, prediction in zip(in_sentences, predictions)]

In your code, you are using input_examples = [run_classifier.InputExample(guid="", text_a = x[0], text_b = x[1], label = 0) for x in in_sentences]. The for x in in-sentences is already grabbing only one sentence at a time, then x[0] and x[1] are grabbing only the first and second characters of each sentence.

>>> sentences = ['Just a little, incomplete sentence.', 'Another little one.']
>>> [(x[0], x[1]) for x in sentences]
[('J', 'u'), ('A', 'n')]

vs.

>>> for x in sentences:
...  print(x)
...
Just a little, incomplete sentence.
Another little one.  

Fixing that first line should get you a lot closer.

As for:

I realize the output is wrong, because it is not considering text_A and text_B being analyzed together = some classification

Sentence pair classification

See 'BERT for Humans Classification Tutorial -> 5.2 Sentence Pair Classification Tasks'.

It works like this:

bert sentence pair matching

Make sure you are using a preprocessor to make that text into something BERT understands. In the case of sentence pair classification, there need to be [CLS] and [SEP] tokens in the appropriate places.

See Preprocessing Text for BERT to understand how to format the sentences, and see this TensorFlow implementation for a complete example. This is their example of how to tokenize a Question-Answer input. The process is similar for Sentence Pairs (As Question-Answer is a subset of the broader sentence pair).

def _get_masks(tokens, max_seq_length):
    """Mask for padding"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    return [1]*len(tokens) + [0] * (max_seq_length - len(tokens))

def _get_segments(tokens, max_seq_length):
    """Segments: 0 for the first sequence, 1 for the second"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    segments = []
    first_sep = True
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            if first_sep:
                first_sep = False 
            else:
                current_segment_id = 1
    return segments + [0] * (max_seq_length - len(tokens))

def _get_ids(tokens, tokenizer, max_seq_length):
    """Token ids from Tokenizer vocab"""
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_ids = token_ids + [0] * (max_seq_length-len(token_ids))
    return input_ids

def _trim_input(title, question, answer, max_sequence_length, 
                t_max_len=30, q_max_len=239, a_max_len=239):

    t = tokenizer.tokenize(title)
    q = tokenizer.tokenize(question)
    a = tokenizer.tokenize(answer)
    
    t_len = len(t)
    q_len = len(q)
    a_len = len(a)

    if (t_len+q_len+a_len+4) > max_sequence_length:
        
        if t_max_len > t_len:
            t_new_len = t_len
            a_max_len = a_max_len + floor((t_max_len - t_len)/2)
            q_max_len = q_max_len + ceil((t_max_len - t_len)/2)
        else:
            t_new_len = t_max_len
      
        if a_max_len > a_len:
            a_new_len = a_len 
            q_new_len = q_max_len + (a_max_len - a_len)
        elif q_max_len > q_len:
            a_new_len = a_max_len + (q_max_len - q_len)
            q_new_len = q_len
        else:
            a_new_len = a_max_len
            q_new_len = q_max_len
            
            
        if t_new_len+a_new_len+q_new_len+4 != max_sequence_length:
            raise ValueError("New sequence length should be %d, but is %d" 
                             % (max_sequence_length, (t_new_len+a_new_len+q_new_len+4)))
        
        t = t[:t_new_len]
        q = q[:q_new_len]
        a = a[:a_new_len]
    
    return t, q, a

def _convert_to_bert_inputs(title, question, answer, tokenizer, max_sequence_length):
    """Converts tokenized input to ids, masks and segments for BERT"""
    
    stoken = ["[CLS]"] + title + ["[SEP]"] + question + ["[SEP]"] + answer + ["[SEP]"]

    input_ids = _get_ids(stoken, tokenizer, max_sequence_length)
    input_masks = _get_masks(stoken, max_sequence_length)
    input_segments = _get_segments(stoken, max_sequence_length)

    return [input_ids, input_masks, input_segments]

def compute_input_arays(df, columns, tokenizer, max_sequence_length):
    input_ids, input_masks, input_segments = [], [], []
    for _, instance in tqdm(df[columns].iterrows()):
        t, q, a = instance.question_title, instance.question_body, instance.answer

        t, q, a = _trim_input(t, q, a, max_sequence_length)

        ids, masks, segments = _convert_to_bert_inputs(t, q, a, tokenizer, max_sequence_length)
        input_ids.append(ids)
        input_masks.append(masks)
        input_segments.append(segments)
        
    return [np.asarray(input_ids, dtype=np.int32), 
            np.asarray(input_masks, dtype=np.int32), 
            np.asarray(input_segments, dtype=np.int32)]


def compute_output_arrays(df, columns):
    return np.asarray(df[columns])

Upvotes: 2

Related Questions