Harshit Kumar
Harshit Kumar

Reputation: 109

Does Bert model need text?

Does Bert models need pre-processed text (Like removing special characters, stopwords, etc.) or I can directly pass my text as it is to Bert models. (HuggigFace libraries).

note: Follow up question to: String cleaning/preprocessing for BERT

Upvotes: 0

Views: 2802

Answers (3)

Anubhav Chhabra
Anubhav Chhabra

Reputation: 31

According to me, pre-processing is not required while training as well as inferring from BERT. I can explain it with a few examples:

  1. So as to continue to @Arthuro's answer, the stop words actually are valuable and BERT internally maps relations between different words.
  2. We even should not clean things like a hyperlink or things like Twitter handle mentions (eg. @someones_twitter_handle). The reason is subword tokenization! BERT uses a special subword tokenization called WordPiece Tokenization. WordPiece tokenizer breaks the words into subwords. HuggingFace has a really nice article that explains how this works.

Upvotes: 1

Arthuro
Arthuro

Reputation: 33

Cleaning the input text for transformer models is not required. Removing stop words (which are considered as noise in conventional text representation like bag-of-words or tf-idf) can and probably will worsen the predictions of your BERT model.

Since BERT is making use of the self-attention mechanism these 'stop words' are valuable information for BERT.

Consider the following example: Python's NLTK library considers words like 'her' or 'him' as stop words. Let's say we want to process a text like: 'I told her about the best restaurants in town'. Removing stop words with NLTK would give us: 'I told best restaurants town'. As you can see a lot of information is being discarded. Sure, we could try and train a classic ML classifier (i.e. topic classification, here food) but BERT captures a lot more semantic information based on the surroundings of words.

Upvotes: 1

Green 绿色
Green 绿色

Reputation: 2916

You need to tokenize your text first. The BertTokenizer class handles everything you need from raw text to tokens. See this:

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

Upvotes: 0

Related Questions