CoderGuy123
CoderGuy123

Reputation: 6649

How do I remove Japanese characters?

I have some data with Japanese characters from survey data. Some of the survey questions and answers (multiple choice) are given in both English and Japanese, e.g. Very rarely かなりまれ". In this case, it is helpful to remove the duplicate Japanese. How does one accomplish this? I only want to remove Japanese, not any other special characters.

Upvotes: 1

Views: 2539

Answers (2)

Evandro Coan
Evandro Coan

Reputation: 9418

You can use this to take out the Hiragana and Katakana:

replace(/[\u30a0-\u30ff\u3040-\u309f]/g, '')
  1. https://regex101.com/r/O5mfPu/1
  2. https://en.wikipedia.org/wiki/Katakana_(Unicode_block)
  3. https://en.wikipedia.org/wiki/Hiragana_(Unicode_block)

See also: JavaScript to replace Chinese characters

Upvotes: 0

CoderGuy123
CoderGuy123

Reputation: 6649

The simplest approach is to keep only ASCII characters. This can be done by replacing non-ASCII with empty strings (e.g. str_replace_all("æøå かな", "[^0-F]", "")), and removing any resulting whitespace. However, if one wants to keep special symbols in general, this approach does not work. In that case one may want to remove only Japanese (including Chinese Kanji) symbols. This can be done by unicode block range matching. I found the Japanese relevant blocks here http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml, but Wikipedia lists them as well e.g. https://en.wikipedia.org/wiki/Katakana_(Unicode_block).

Here's a ready-made function (requires tidyverse and assertthat):

str_rm_jap = function(x) {
  #we replace japanese blocks with nothing, and clean any double whitespace from this
  #reference at http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
  x %>% 
    #japanese style punctuation
    str_replace_all("[\u3000-\u303F]", "") %>% 
    #katakana
    str_replace_all("[\u30A0-\u30FF]", "") %>% 
    #hiragana
    str_replace_all("[\u3040-\u309F]", "") %>% 
    #kanji
    str_replace_all("[\u4E00-\u9FAF]", "") %>% 
    #remove excess whitespace
    str_replace_all("  +", " ") %>% 
    str_trim()
}

#tests
assert_that(
  #positive tests
  "Very rarely かなりまれ" %>% str_rm_jap() %>% equals("Very rarely"),
  "Comments ノートとコメント" %>% str_rm_jap() %>% equals("Comments"),

  #negative tests
  "Danish ok! ÆØÅ" %>% str_rm_jap() %>% equals("Danish ok! ÆØÅ")
)

Upvotes: 2

Related Questions