Slurp
Slurp

Reputation: 91

How can I import a function from source in r and keep hebrew characters

I'm having problems with language encoding in R when loading a function from a file using source().

The function (defined below) takes a text file in hebrew and searches for specific words. If I have the function defined as part of an RStudio script, everything works as expected. But if I save the function to disk, load it using source, the hebrew search string is converted into what appears to be gibberish and the search fails to find the search string. The search string is definitely present in the text file, and that is correctly loaded in hebrew.

I've tried surrounding the hebrew with utf8::as_utf8("מסכת"), for example, but that has no effect.

Here's the function code & libraries:

library(stringr)
library(dplyr)
library(rvest)

test_fn <- function(x) {
    raw_text <- read_html(x)
    masechet <- raw_text %>% html_nodes("h2") %>%
        head(1) %>% html_text() %>%
        str_remove("מסכת") %>%
        str_remove("פרק א") %>% str_trim
    message(masechet)
}

To be clear: if that's part of an RStudio window, it all works fine. But if I load it like this:

assemble <- source("test.r")
test_fn <- assemble$value

I get the following for the hebrew text:

     str_remove("פרק ×") %>% str_trim

And if I try to tell source() to use encoding I get an error and it doesn't load at all:

assemble <- source("test_fn.r", encoding = "UTF-8")
Error in source("test_fn.r", encoding = "UTF-8") : 
  test_fn.r:5:20: unexpected INCOMPLETE_STRING
4:         head(1) %>% html_text() %>%
5:         str_remove("
                      ^
In addition: Warning message:
In readLines(file, warn = FALSE) :
  invalid input found on input connection 'test_fn.r'

Running on Windows 10 in UK. Sys.getlocale() returns the following: "LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252" Any help would be appreciated.

Upvotes: 0

Views: 170

Answers (1)

user2554330
user2554330

Reputation: 44907

As @MrFlick noted, you are on Windows; R on Windows has known problems with UTF-8 strings, because Windows doesn't support them the way Unix-alikes do.

What I'd suggest you do is to make sure your code files are pure ASCII. To do this, you'll need to encode your Hebrew strings using \uXXXX escapes. It's a little painful to find those, but this function will do it for you:

asEscapes <- function(x) 
  cat(paste0('"', paste(sprintf("\\u%x", utf8ToInt(x)), collapse = ""), '"'))

For example,

asEscapes("מסכת")
# "\u5de\u5e1\u5db\u5ea"

So you'd use str_remove("\u5de\u5e1\u5db\u5ea") in place of str_remove("מסכת") and you should get the same results.

Upvotes: 1

Related Questions