How do I remove Japanese characters?

Question

I have some data with Japanese characters from survey data. Some of the survey questions and answers (multiple choice) are given in both English and Japanese, e.g. Very rarely かなりまれ". In this case, it is helpful to remove the duplicate Japanese. How does one accomplish this? I only want to remove Japanese, not any other special characters.

CoderGuy123 · Accepted Answer

The simplest approach is to keep only ASCII characters. This can be done by replacing non-ASCII with empty strings (e.g. str_replace_all("æøå かな", "[^0-F]", "")), and removing any resulting whitespace. However, if one wants to keep special symbols in general, this approach does not work. In that case one may want to remove only Japanese (including Chinese Kanji) symbols. This can be done by unicode block range matching. I found the Japanese relevant blocks here http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml, but Wikipedia lists them as well e.g. https://en.wikipedia.org/wiki/Katakana_(Unicode_block).

Here's a ready-made function (requires tidyverse and assertthat):

str_rm_jap = function(x) {
  #we replace japanese blocks with nothing, and clean any double whitespace from this
  #reference at http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
  x %>% 
    #japanese style punctuation
    str_replace_all("[\u3000-\u303F]", "") %>% 
    #katakana
    str_replace_all("[\u30A0-\u30FF]", "") %>% 
    #hiragana
    str_replace_all("[\u3040-\u309F]", "") %>% 
    #kanji
    str_replace_all("[\u4E00-\u9FAF]", "") %>% 
    #remove excess whitespace
    str_replace_all("  +", " ") %>% 
    str_trim()
}

#tests
assert_that(
  #positive tests
  "Very rarely かなりまれ" %>% str_rm_jap() %>% equals("Very rarely"),
  "Comments ノートとコメント" %>% str_rm_jap() %>% equals("Comments"),

  #negative tests
  "Danish ok! ÆØÅ" %>% str_rm_jap() %>% equals("Danish ok! ÆØÅ")
)

How do I remove Japanese characters?

Answers (2)

Related Questions