Crops
Crops

Reputation: 5154

Mixed kana and kanji romanization to romaji in R

I have a large character vector of japanese words (mixed kanji and kana) which needs to be romanized (to romaji).

However with the available functions, (zipangu::str_conv_romanhira() and audubon::strj_romanize()), I am not getting the desired results.

For example for 北海道 (Hokkaido), zipangu::str_conv_romanhira() convert it to chinese pinyin and audubon::strj_romanize() converts only kana characters.

How to convert such mixed kana and kanji text to romaji.

library(zipangu)
library(stringi)
library(audubon)


str_conv_romanhira("北海道", "roman")
#> [1] "běi hǎi dào"

stri_trans_general("北海道", "Any-Latin")
#> [1] "běi hǎi dào"

strj_romanize("北海道")
#> [1] ""

Upvotes: 1

Views: 469

Answers (1)

lroha
lroha

Reputation: 34441

There aren't any R packages that provide transliteration of Japanese kanji to romaji that I can see (at least none that are currently on CRAN). It's easy enough, however, to use the python module pykakasi via R to achieve this:

library(reticulate)

py_install("pykakasi")  # Only need to install once

# Make module available in R
pykakasi <- import("pykakasi")

# Alias the convert function for convenience
convert <- pykakasi$kakasi()$convert

convert("北海道")

[[1]]
[[1]]$orig
[1] "北海道"

[[1]]$hira
[1] "ほっかいどう"

[[1]]$kana
[1] "ホッカイドウ"

[[1]]$hepburn
[1] "hokkaidou"

[[1]]$kunrei
[1] "hokkaidou"

[[1]]$passport
[1] "hokkaidou"

# Function to extract romaji and collapse
to_romaji <- function(txt) {
  paste(sapply(convert(txt), `[[`, "hepburn"), collapse = " ")
  }

# Test on some longer text
lapply(c("北海道", "石の上にも三年", "豚に真珠"), to_romaji)

[[1]]
[1] "hokkaidou"

[[2]]
[1] "ishi no ueni mo sannen"

[[3]]
[1] "buta ni shinju"

Upvotes: 2

Related Questions