Reputation: 109
I recently tested on the below code based on the source: https://github.com/cl-tohoku/bert-japanese/blob/master/masked_lm_example.ipynb
import torch
from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
from transformers.modeling_bert import BertForMaskedLM
tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
model = BertForMaskedLM.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
input_ids = tokenizer.encode(f'''
青葉山で{tokenizer.mask_token}の研究をしています。
''', return_tensors='pt')
when i try to encode it, I received error such as:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-29-f8582275f4db> in <module>
1 input_ids = tokenizer.encode(f'''
2 青葉山で{tokenizer.mask_token}の研究をしています。
----> 3 ''', return_tensors='pt')
~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in encode(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, return_tensors, **kwargs)
1428 stride=stride,
1429 return_tensors=return_tensors,
-> 1430 **kwargs,
1431 )
1432
~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
1740 return_length=return_length,
1741 verbose=verbose,
-> 1742 **kwargs,
1743 )
1744
~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_utils.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
452 )
453
--> 454 first_ids = get_input_ids(text)
455 second_ids = get_input_ids(text_pair) if text_pair is not None else None
456
~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_utils.py in get_input_ids(text)
423 def get_input_ids(text):
424 if isinstance(text, str):
--> 425 tokens = self.tokenize(text, **kwargs)
426 return self.convert_tokens_to_ids(tokens)
427 elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):
~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_utils.py in tokenize(self, text, **kwargs)
362
363 no_split_token = self.unique_no_split_tokens
--> 364 tokenized_text = split_on_tokens(no_split_token, text)
365 return tokenized_text
366
~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_utils.py in split_on_tokens(tok_list, text)
356 (
357 self._tokenize(token) if token not in self.unique_no_split_tokens else [token]
--> 358 for token in tokenized_text
359 )
360 )
~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_utils.py in <genexpr>(.0)
356 (
357 self._tokenize(token) if token not in self.unique_no_split_tokens else [token]
--> 358 for token in tokenized_text
359 )
360 )
~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_bert_japanese.py in _tokenize(self, text)
153 def _tokenize(self, text):
154 if self.do_word_tokenize:
--> 155 tokens = self.word_tokenizer.tokenize(text, never_split=self.all_special_tokens)
156 else:
157 tokens = [text]
~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_bert_japanese.py in tokenize(self, text, never_split, **kwargs)
205 break
206
--> 207 token, _ = line.split("\t")
208 token_start = text.index(token, cursor)
209 token_end = token_start + len(token)
ValueError: too many values to unpack (expected 2)
Does anyone experienced this before? I tried a lot of different ways and refer to many posts but all using the same methods and no explanations, I just wanted to test multiple languages, other languages seem to work fine but not with japanese and I dont know why.
Upvotes: 2
Views: 2155
Reputation: 15633
NOTE: Shortly after this question I released a version of IPADic that works with the latest versions of mecab-python3. You should be able to fix things by installing transformers[ja]
, which will install the main dictionaries used with HuggingFace models.
I'm the mecab-python3 maintainer. Transformers relies on the bundled dictionary in versions prior to 1.0, which has been removed because it's old. I will be adding it as an option in a release soon, but in the meantime you can install an old version.
The command posted by vivasra doesn't work because it specifies a version of a different package (notice there's no "3" in the package name) that doesn't exist. You can use this:
pip install mecab-python3=0.996.5
If you still have trouble please open an issue on Github.
Upvotes: 2
Reputation: 23
With a quick check, no errors for me, maybe there are some version issues in your case?
From what it looks like, the error occurs with the BertJapaneseTokenizer, so possibly the version of the tokenizer (mecab?) that you have is incompatible with your environment.
The mecab-python in my environment:
!pip list | grep mecab
#mecab-python3 0.996.5
Maybe you could create a new environment or try the below (or some other available version):
!pip install 'mecab-python3==0.996.5' --force-reinstall
Edit: fixed the environment setting (thanks, @polm23)
Upvotes: 0