Grzegorz Piotr
Grzegorz Piotr

Reputation: 31

R character strings are limited to 2^31-1 bytes - How to workaround the R problem?

I am working on the R code, trying to read the content of the file through API call. The content is base64 encoded and the file itself is over 2GB big.

I have tried few approaches by now, downloading the file in memory, writing it to disk. In both cases when I try to decode the file it fails with:

Error in readLines: R character strings are limited to 2^31-1 bytes

Has anyone faced this in the past and knows how to work this out? At first I tried the call with httr, then switched to httr2:

 req <- request(test_datapull_API) |> 
        req_headers('Content-Type' = 'application/json', 'Cookie' = sprintf('token=%s; username=%s', token, input$user_id)) |> 
        req_body_json(call_body)
       
   
       tmp <- tempfile()
       
       req_perform(
         req,
         path = tmp
       )
       
       stringvalue <- readLines(tmp) #this is where it fails with the mentioned error
       
       
      # decode the base64 string to binary (raw) data
       b64_result <- base64_decode(stringvalue)

I am running this on Docker, I have more than 50GB's there, no limit per container. I don't believe this would be an R memory issue.

Anyone?

I tried the download and decoding with httr, httr2 and base64 and openssl libs.

Upvotes: 2

Views: 617

Answers (2)

Grzegorz Piotr
Grzegorz Piotr

Reputation: 31

This is the solution that worked for me:

 # Function to decode Base64 in chunks

   decode_data_in_chunks <- function(tmp, ftemp, chunk_size = 32768) {
     
     ftemp <- tempfile(fileext = ".zip")
     f <- file.create(ftemp)
     raw_data <- readBin(tmp, "raw", file.info(tmp)$size)
     
     f <- file(ftemp, open = "wb")
     
     for (i in seq(1, length(raw_data), by = chunk_size)) {
       raw_chunk <- raw_data[i:min(i + chunk_size - 1, length(raw_data))]
       
       data_chunk <- rawToChar(raw_chunk)
       padding_length <- (4 - nchar(data_chunk) %% 4) %% 4
       data_chunk_padded <- paste0(data_chunk, strrep("=", padding_length))
       
       decoded_chunk <- base64decode(data_chunk_padded)
       
       writeBin(decoded_chunk, f)
     }
     
     close(f)
     unzip(ftemp)
   }
   
   # Use the function
   decode_data_in_chunks(tmp, ftemp)

Upvotes: 1

SamR
SamR

Reputation: 20240

I've had to do something similar in the past - here's a slightly adapted version.

Iterating over lines

The important thing is to get the number of lines in the file without opening the entire file. This means you know in advance how long your output list is going to be so you don't face problems with growing a list. The way to do this differs depending on your OS.

get_n_lines <- function(file_path) {
    if (Sys.info()["sysname"] == "Windows") {
        powershell_output <- system2(
            "powershell",
            args = sprintf('Get-Content "%s" | Measure-Object -Line | Select-Object -ExpandProperty Lines', shQuote(file_path)),
            stdout = TRUE
        )
        return(as.integer(powershell_output))
    }

    # Otherwise Linux/Mac
    wc_output <- system2(
        "wc",
        args = sprintf("-l %s", shQuote(file_path)),
        stdout = TRUE
    )

    as.integer(strsplit(wc_output, "\\s+")[[c(1, 1)]])
}

Then the function to read the file in chunks is straightforward:

read_file_in_chunks <- function(file_path, chunk_size) {
    n_lines <- get_n_lines(file_path)
    n_chunks <- ceiling(n_lines / chunk_size)

    con <- file(file_path, "r")
    on.exit(close(con))

    lapply(seq(n_chunks), \(i) readLines(con, n = chunk_size))
}

To test it we can create a temporary text files with, for example, 903 lines:

sprintf("line_%s", seq(903)) |>
    writeLines("tmp.txt")

Then read it back in:

chunk_list <- read_file_in_chunks(
    file_path = "tmp.txt",
    chunk_size = 100
)
str(chunk_list)
# List of 10
#  $ : chr [1:100] "line_1" "line_2" "line_3" "line_4" ...
#  $ : chr [1:100] "line_101" "line_102" "line_103" "line_104" ...
#  $ : chr [1:100] "line_201" "line_202" "line_203" "line_204" ...
#  $ : chr [1:100] "line_301" "line_302" "line_303" "line_304" ...
#  $ : chr [1:100] "line_401" "line_402" "line_403" "line_404" ...
#  $ : chr [1:100] "line_501" "line_502" "line_503" "line_504" ...
#  $ : chr [1:100] "line_601" "line_602" "line_603" "line_604" ...
#  $ : chr [1:100] "line_701" "line_702" "line_703" "line_704" ...
#  $ : chr [1:100] "line_801" "line_802" "line_803" "line_804" ...
#  $ : chr [1:3] "line_901" "line_902" "line_903"

base64 encoded data

Note this also works if we convert the string to json and then base 64 encode it. This is because by default base 64 wraps lines after 76 characters.

# Create 903 line json string
sprintf("line_%s", seq(903)) |>
    jsonlite::toJSON() |>
    jsonlite::base64_enc() |>
    writeLines("tmp.txt")

chunk_list <- read_file_in_chunks(
    file_path = "tmp.txt",
    chunk_size = 100
)

lapply(
    chunk_list,
    \(x) jsonlite::base64_dec(x) |>
        intToUtf8()
) |>
    str()
# List of 2
#  $ : chr "[\"line_1\",\"line_2\",\"line_3\",\"line_4\",\"line_5\",\"line_6\",\"line_7\",\"line_8\",\"line_9\",\"line_10\""| __truncated__
#  $ : chr "01\",\"line_502\",\"line_503\",\"line_504\",\"line_505\",\"line_506\",\"line_507\",\"line_508\",\"line_509\",\""| __truncated__

However the line breaks are not meaningful with this approach, you'll have to look at the structure of your json data to work out where to put the chunks back together.

Putting base64 encoded json back together

This will depend on the format your exact data but in the absence of that I will replicate a base64 encoded version of mtcars a thousand times as an example:

replicate(1e3, mtcars, simplify = FALSE) |>
    do.call(rbind, args = _) |>
    jsonlite::toJSON() |>
    jsonlite::base64_enc() |>
    writeLines("tmp.txt")

# Read it back in
chunk_list <- read_file_in_chunks("tmp.txt", 100)

This is basically json array of rows in the following format:

[{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.62,"qsec":16.46,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4"},
  {"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.875,"qsec":17.02,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4 Wag"}
  ...
]

We can write a function to parse these rows. The important thing is to pass the last, incomplete line of each one to the next chunk, where we prepend it.

parse_chunk <- function(chunk, prev_chunk_tail = NULL, last_chunk = FALSE) {
    if (is.null(prev_chunk_tail)) { # first chunk
        hex_to_parse <- chunk
        start_chr <- 2 # remove opening `[` in first chunk
    } else {
        hex_to_parse <- c(prev_chunk_tail, chunk)
        start_chr <- 1
    }

    txt_to_parse <- hex_to_parse |>
        jsonlite::base64_dec() |>
        intToUtf8()

    # In last chunk remove `[` and do not
    # cut off final string
    if (last_chunk) {
        end_chr <- nchar(txt_to_parse) - 1
        head_n <- Inf
    } else {
        end_chr <- nchar(txt_to_parse)
        head_n <- -1
    }

    txt_split <- txt_to_parse |>
        substr(start_chr, end_chr) |>
        strsplit("(?<=}),(?={)", perl = TRUE) |>
        el(1)

    df <- txt_split |>
        head(head_n) |> # cut off last (incomplete) string
        lapply(\(str) jsonlite::fromJSON(str)) |>
        do.call(rbind, args = _)

    return(list(
        df = df,
        prev_chunk_tail = jsonlite::base64_enc(tail(txt_split, 1))
    ))
}

Then loop over the chunks, passing the final incomplete line to next chunk:

df_list <- vector(mode = "list", length = length(chunk_list))
for (i in seq(chunk_list)) {
    if (i == 1) {
        res <- parse_chunk(chunk_list[[i]])
    } else if (i == length(chunk_list)) {
        res <- parse_chunk(chunk_list[[i]], prev_chunk_tail = res$prev_chunk_tail, last_chunk = TRUE)
    } else {
        res <- parse_chunk(chunk_list[[i]], prev_chunk_tail = res$prev_chunk_tail)
    }

    df_list[[i]] <- res$df
}

This will piece back together the json chunks.

out_df  <- data.frame(do.call(rbind, df_list))
dim(out_df) # 32000, 12
head(out_df)
#    mpg cyl disp  hp drat    wt  qsec vs am gear carb             X_row
# 1   21   6  160 110  3.9  2.62 16.46  0  1    4    4         Mazda RX4
# 2   21   6  160 110  3.9 2.875 17.02  0  1    4    4     Mazda RX4 Wag
# 3 22.8   4  108  93 3.85  2.32 18.61  1  1    4    1        Datsun 710
# 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    Hornet 4 Drive
# 5 18.7   8  360 175 3.15  3.44 17.02  0  0    3    2 Hornet Sportabout
# 6 18.1   6  225 105 2.76  3.46 20.22  1  0    3    1           Valiant

Upvotes: 1

Related Questions