Yaman Afadar
Yaman Afadar

Reputation: 53

Bert Tokenizing error ValueError: Input nan is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers

I am using the Bert for text classification task , when I try to tokenize one data sample using the code:

encoded_sent = tokenizer.encode(
                        sentences[7],                       
                        add_special_tokens = True)

it goes well but when ever i try to tokenize the whole data using the code:

# For every sentence...
for sent in sentences:
    
    encoded_sent = tokenizer.encode(
                        sent,                       
                        add_special_tokens = True)

it gives me the error:

"ValueError: Input nan is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."

I tried in English data that was successfully tokenized by someone and I get the same error. This is how i load my data:

import pandas as pd

df=pd.read_csv("/content/DATA.csv",header=0,dtype=str)
DATA_COLUMN = 'sentence'
LABEL_COLUMN = 'label'
df.columns = [DATA_COLUMN, LABEL_COLUMN]

df["sentence"].head

and this is how i load the tokenizer:

# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = AutoTokenizer.from_pretrained('aubmindlab/bert-base-arabert')

a sample of my data:

Original: مساعد نائب رئيس المنزل: لم نر حتى رسالة كومي حتى غردها جيسون تشافيتز

Tokenized: ['مساعد', 'نائب', 'رئيس', 'ال', '##منزل', ':', 'لم', 'نر', 'حتى', 'رسال', '##ة', 'كومي', 'حتى', 'غرد', '##ها', 'جيسون', 'تشافي', '##ت', '##ز']

any suggestions please?!

Upvotes: 1

Views: 8558

Answers (1)

rulcaster
rulcaster

Reputation: 29

It seems like your data contain NAN values, to surpass this issue, you have to eliminate NAN values or transform all data to string (local solution).

Try using:

encoded_sent = tokenizer.encode(
        str(sent),                       
        add_special_tokens = True)

If you're sure that the dataset doesn't count NAN values you might use that solution, or to detect if your dataset contain NAN values you might use:

for sent in sentences: 
    print(sent) 
    encoded_sent = tokenizer.encode( sent, add_special_tokens = True)

Upvotes: 2

Related Questions