Reputation: 5154
I have a large character vector of japanese words (mixed kanji and kana) which needs to be romanized (to romaji).
However with the available functions, (zipangu::str_conv_romanhira()
and audubon::strj_romanize()
), I am not getting the desired results.
For example for 北海道 (Hokkaido), zipangu::str_conv_romanhira()
convert it to chinese pinyin and audubon::strj_romanize()
converts only kana characters.
How to convert such mixed kana and kanji text to romaji.
library(zipangu)
library(stringi)
library(audubon)
str_conv_romanhira("北海道", "roman")
#> [1] "běi hǎi dào"
stri_trans_general("北海道", "Any-Latin")
#> [1] "běi hǎi dào"
strj_romanize("北海道")
#> [1] ""
Upvotes: 1
Views: 469
Reputation: 34441
There aren't any R packages that provide transliteration of Japanese kanji to romaji that I can see (at least none that are currently on CRAN). It's easy enough, however, to use the python module pykakasi via R to achieve this:
library(reticulate)
py_install("pykakasi") # Only need to install once
# Make module available in R
pykakasi <- import("pykakasi")
# Alias the convert function for convenience
convert <- pykakasi$kakasi()$convert
convert("北海道")
[[1]]
[[1]]$orig
[1] "北海道"
[[1]]$hira
[1] "ほっかいどう"
[[1]]$kana
[1] "ホッカイドウ"
[[1]]$hepburn
[1] "hokkaidou"
[[1]]$kunrei
[1] "hokkaidou"
[[1]]$passport
[1] "hokkaidou"
# Function to extract romaji and collapse
to_romaji <- function(txt) {
paste(sapply(convert(txt), `[[`, "hepburn"), collapse = " ")
}
# Test on some longer text
lapply(c("北海道", "石の上にも三年", "豚に真珠"), to_romaji)
[[1]]
[1] "hokkaidou"
[[2]]
[1] "ishi no ueni mo sannen"
[[3]]
[1] "buta ni shinju"
Upvotes: 2