Reputation: 53
I am using the Bert for text classification task , when I try to tokenize one data sample using the code:
encoded_sent = tokenizer.encode(
sentences[7],
add_special_tokens = True)
it goes well but when ever i try to tokenize the whole data using the code:
# For every sentence...
for sent in sentences:
encoded_sent = tokenizer.encode(
sent,
add_special_tokens = True)
it gives me the error:
"ValueError: Input nan is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
I tried in English data that was successfully tokenized by someone and I get the same error. This is how i load my data:
import pandas as pd
df=pd.read_csv("/content/DATA.csv",header=0,dtype=str)
DATA_COLUMN = 'sentence'
LABEL_COLUMN = 'label'
df.columns = [DATA_COLUMN, LABEL_COLUMN]
df["sentence"].head
and this is how i load the tokenizer:
# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = AutoTokenizer.from_pretrained('aubmindlab/bert-base-arabert')
a sample of my data:
Original: مساعد نائب رئيس المنزل: لم نر حتى رسالة كومي حتى غردها جيسون تشافيتز
Tokenized: ['مساعد', 'نائب', 'رئيس', 'ال', '##منزل', ':', 'لم', 'نر', 'حتى', 'رسال', '##ة', 'كومي', 'حتى', 'غرد', '##ها', 'جيسون', 'تشافي', '##ت', '##ز']
any suggestions please?!
Upvotes: 1
Views: 8558
Reputation: 29
It seems like your data contain NAN values, to surpass this issue, you have to eliminate NAN values or transform all data to string (local solution).
Try using:
encoded_sent = tokenizer.encode(
str(sent),
add_special_tokens = True)
If you're sure that the dataset doesn't count NAN values you might use that solution, or to detect if your dataset contain NAN values you might use:
for sent in sentences:
print(sent)
encoded_sent = tokenizer.encode( sent, add_special_tokens = True)
Upvotes: 2