Reputation: 6649
I have some data with Japanese characters from survey data. Some of the survey questions and answers (multiple choice) are given in both English and Japanese, e.g. Very rarely かなりまれ"
. In this case, it is helpful to remove the duplicate Japanese. How does one accomplish this? I only want to remove Japanese, not any other special characters.
Upvotes: 1
Views: 2539
Reputation: 9418
You can use this to take out the Hiragana and Katakana:
replace(/[\u30a0-\u30ff\u3040-\u309f]/g, '')
See also: JavaScript to replace Chinese characters
Upvotes: 0
Reputation: 6649
The simplest approach is to keep only ASCII characters. This can be done by replacing non-ASCII with empty strings (e.g. str_replace_all("æøå かな", "[^0-F]", "")
), and removing any resulting whitespace. However, if one wants to keep special symbols in general, this approach does not work. In that case one may want to remove only Japanese (including Chinese Kanji) symbols. This can be done by unicode block range matching. I found the Japanese relevant blocks here http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml, but Wikipedia lists them as well e.g. https://en.wikipedia.org/wiki/Katakana_(Unicode_block).
Here's a ready-made function (requires tidyverse and assertthat):
str_rm_jap = function(x) {
#we replace japanese blocks with nothing, and clean any double whitespace from this
#reference at http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
x %>%
#japanese style punctuation
str_replace_all("[\u3000-\u303F]", "") %>%
#katakana
str_replace_all("[\u30A0-\u30FF]", "") %>%
#hiragana
str_replace_all("[\u3040-\u309F]", "") %>%
#kanji
str_replace_all("[\u4E00-\u9FAF]", "") %>%
#remove excess whitespace
str_replace_all(" +", " ") %>%
str_trim()
}
#tests
assert_that(
#positive tests
"Very rarely かなりまれ" %>% str_rm_jap() %>% equals("Very rarely"),
"Comments ノートとコメント" %>% str_rm_jap() %>% equals("Comments"),
#negative tests
"Danish ok! ÆØÅ" %>% str_rm_jap() %>% equals("Danish ok! ÆØÅ")
)
Upvotes: 2