Reputation: 109
Does Bert models need pre-processed text (Like removing special characters, stopwords, etc.) or I can directly pass my text as it is to Bert models. (HuggigFace libraries).
note: Follow up question to: String cleaning/preprocessing for BERT
Upvotes: 0
Views: 2802
Reputation: 31
According to me, pre-processing is not required while training as well as inferring from BERT. I can explain it with a few examples:
Upvotes: 1
Reputation: 33
Cleaning the input text for transformer models is not required. Removing stop words (which are considered as noise in conventional text representation like bag-of-words or tf-idf) can and probably will worsen the predictions of your BERT model.
Since BERT is making use of the self-attention mechanism these 'stop words' are valuable information for BERT.
Consider the following example: Python's NLTK library considers words like 'her' or 'him' as stop words. Let's say we want to process a text like: 'I told her about the best restaurants in town'. Removing stop words with NLTK would give us: 'I told best restaurants town'. As you can see a lot of information is being discarded. Sure, we could try and train a classic ML classifier (i.e. topic classification, here food) but BERT captures a lot more semantic information based on the surroundings of words.
Upvotes: 1
Reputation: 2916
You need to tokenize your text first. The BertTokenizer
class handles everything you need from raw text to tokens. See this:
from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
Upvotes: 0