Anony
Anony

Reputation: 109

HuggingFace for Japanese tokenizer

I recently tested on the below code based on the source: https://github.com/cl-tohoku/bert-japanese/blob/master/masked_lm_example.ipynb

import torch 
from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
from transformers.modeling_bert import BertForMaskedLM

tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
model = BertForMaskedLM.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')

input_ids = tokenizer.encode(f'''
    青葉山で{tokenizer.mask_token}の研究をしています。
''', return_tensors='pt')

when i try to encode it, I received error such as:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-29-f8582275f4db> in <module>
      1 input_ids = tokenizer.encode(f'''
      2     青葉山で{tokenizer.mask_token}の研究をしています。
----> 3 ''', return_tensors='pt')

~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in encode(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, return_tensors, **kwargs)
   1428             stride=stride,
   1429             return_tensors=return_tensors,
-> 1430             **kwargs,
   1431         )
   1432 

~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   1740             return_length=return_length,
   1741             verbose=verbose,
-> 1742             **kwargs,
   1743         )
   1744 

~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_utils.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
    452             )
    453 
--> 454         first_ids = get_input_ids(text)
    455         second_ids = get_input_ids(text_pair) if text_pair is not None else None
    456 

~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_utils.py in get_input_ids(text)
    423         def get_input_ids(text):
    424             if isinstance(text, str):
--> 425                 tokens = self.tokenize(text, **kwargs)
    426                 return self.convert_tokens_to_ids(tokens)
    427             elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):

~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_utils.py in tokenize(self, text, **kwargs)
    362 
    363         no_split_token = self.unique_no_split_tokens
--> 364         tokenized_text = split_on_tokens(no_split_token, text)
    365         return tokenized_text
    366 

~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_utils.py in split_on_tokens(tok_list, text)
    356                     (
    357                         self._tokenize(token) if token not in self.unique_no_split_tokens else [token]
--> 358                         for token in tokenized_text
    359                     )
    360                 )

~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_utils.py in <genexpr>(.0)
    356                     (
    357                         self._tokenize(token) if token not in self.unique_no_split_tokens else [token]
--> 358                         for token in tokenized_text
    359                     )
    360                 )

~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_bert_japanese.py in _tokenize(self, text)
    153     def _tokenize(self, text):
    154         if self.do_word_tokenize:
--> 155             tokens = self.word_tokenizer.tokenize(text, never_split=self.all_special_tokens)
    156         else:
    157             tokens = [text]

~/.pyenv/versions/3.7.0/envs/personal/lib/python3.7/site-packages/transformers/tokenization_bert_japanese.py in tokenize(self, text, never_split, **kwargs)
    205                 break
    206 
--> 207             token, _ = line.split("\t")
    208             token_start = text.index(token, cursor)
    209             token_end = token_start + len(token)

ValueError: too many values to unpack (expected 2)

Does anyone experienced this before? I tried a lot of different ways and refer to many posts but all using the same methods and no explanations, I just wanted to test multiple languages, other languages seem to work fine but not with japanese and I dont know why.

Upvotes: 2

Views: 2155

Answers (2)

polm23
polm23

Reputation: 15633

NOTE: Shortly after this question I released a version of IPADic that works with the latest versions of mecab-python3. You should be able to fix things by installing transformers[ja], which will install the main dictionaries used with HuggingFace models.


I'm the mecab-python3 maintainer. Transformers relies on the bundled dictionary in versions prior to 1.0, which has been removed because it's old. I will be adding it as an option in a release soon, but in the meantime you can install an old version.

The command posted by vivasra doesn't work because it specifies a version of a different package (notice there's no "3" in the package name) that doesn't exist. You can use this:

pip install mecab-python3=0.996.5

If you still have trouble please open an issue on Github.

Upvotes: 2

vivasra
vivasra

Reputation: 23

With a quick check, no errors for me, maybe there are some version issues in your case?

From what it looks like, the error occurs with the BertJapaneseTokenizer, so possibly the version of the tokenizer (mecab?) that you have is incompatible with your environment.

The mecab-python in my environment:

!pip list | grep mecab
#mecab-python3  0.996.5

Maybe you could create a new environment or try the below (or some other available version):

!pip install 'mecab-python3==0.996.5' --force-reinstall

Edit: fixed the environment setting (thanks, @polm23)

Upvotes: 0

Related Questions