Reputation: 31
I am working on the R code, trying to read the content of the file through API call. The content is base64 encoded and the file itself is over 2GB big.
I have tried few approaches by now, downloading the file in memory, writing it to disk. In both cases when I try to decode the file it fails with:
Error in readLines: R character strings are limited to 2^31-1 bytes
Has anyone faced this in the past and knows how to work this out? At first I tried the call with httr, then switched to httr2:
req <- request(test_datapull_API) |>
req_headers('Content-Type' = 'application/json', 'Cookie' = sprintf('token=%s; username=%s', token, input$user_id)) |>
req_body_json(call_body)
tmp <- tempfile()
req_perform(
req,
path = tmp
)
stringvalue <- readLines(tmp) #this is where it fails with the mentioned error
# decode the base64 string to binary (raw) data
b64_result <- base64_decode(stringvalue)
I am running this on Docker, I have more than 50GB's there, no limit per container. I don't believe this would be an R memory issue.
Anyone?
I tried the download and decoding with httr, httr2 and base64 and openssl libs.
Upvotes: 2
Views: 617
Reputation: 31
This is the solution that worked for me:
# Function to decode Base64 in chunks
decode_data_in_chunks <- function(tmp, ftemp, chunk_size = 32768) {
ftemp <- tempfile(fileext = ".zip")
f <- file.create(ftemp)
raw_data <- readBin(tmp, "raw", file.info(tmp)$size)
f <- file(ftemp, open = "wb")
for (i in seq(1, length(raw_data), by = chunk_size)) {
raw_chunk <- raw_data[i:min(i + chunk_size - 1, length(raw_data))]
data_chunk <- rawToChar(raw_chunk)
padding_length <- (4 - nchar(data_chunk) %% 4) %% 4
data_chunk_padded <- paste0(data_chunk, strrep("=", padding_length))
decoded_chunk <- base64decode(data_chunk_padded)
writeBin(decoded_chunk, f)
}
close(f)
unzip(ftemp)
}
# Use the function
decode_data_in_chunks(tmp, ftemp)
Upvotes: 1
Reputation: 20240
I've had to do something similar in the past - here's a slightly adapted version.
The important thing is to get the number of lines in the file without opening the entire file. This means you know in advance how long your output list is going to be so you don't face problems with growing a list. The way to do this differs depending on your OS.
get_n_lines <- function(file_path) {
if (Sys.info()["sysname"] == "Windows") {
powershell_output <- system2(
"powershell",
args = sprintf('Get-Content "%s" | Measure-Object -Line | Select-Object -ExpandProperty Lines', shQuote(file_path)),
stdout = TRUE
)
return(as.integer(powershell_output))
}
# Otherwise Linux/Mac
wc_output <- system2(
"wc",
args = sprintf("-l %s", shQuote(file_path)),
stdout = TRUE
)
as.integer(strsplit(wc_output, "\\s+")[[c(1, 1)]])
}
Then the function to read the file in chunks is straightforward:
read_file_in_chunks <- function(file_path, chunk_size) {
n_lines <- get_n_lines(file_path)
n_chunks <- ceiling(n_lines / chunk_size)
con <- file(file_path, "r")
on.exit(close(con))
lapply(seq(n_chunks), \(i) readLines(con, n = chunk_size))
}
To test it we can create a temporary text files with, for example, 903 lines:
sprintf("line_%s", seq(903)) |>
writeLines("tmp.txt")
Then read it back in:
chunk_list <- read_file_in_chunks(
file_path = "tmp.txt",
chunk_size = 100
)
str(chunk_list)
# List of 10
# $ : chr [1:100] "line_1" "line_2" "line_3" "line_4" ...
# $ : chr [1:100] "line_101" "line_102" "line_103" "line_104" ...
# $ : chr [1:100] "line_201" "line_202" "line_203" "line_204" ...
# $ : chr [1:100] "line_301" "line_302" "line_303" "line_304" ...
# $ : chr [1:100] "line_401" "line_402" "line_403" "line_404" ...
# $ : chr [1:100] "line_501" "line_502" "line_503" "line_504" ...
# $ : chr [1:100] "line_601" "line_602" "line_603" "line_604" ...
# $ : chr [1:100] "line_701" "line_702" "line_703" "line_704" ...
# $ : chr [1:100] "line_801" "line_802" "line_803" "line_804" ...
# $ : chr [1:3] "line_901" "line_902" "line_903"
Note this also works if we convert the string to json and then base 64 encode it. This is because by default base 64 wraps lines after 76 characters.
# Create 903 line json string
sprintf("line_%s", seq(903)) |>
jsonlite::toJSON() |>
jsonlite::base64_enc() |>
writeLines("tmp.txt")
chunk_list <- read_file_in_chunks(
file_path = "tmp.txt",
chunk_size = 100
)
lapply(
chunk_list,
\(x) jsonlite::base64_dec(x) |>
intToUtf8()
) |>
str()
# List of 2
# $ : chr "[\"line_1\",\"line_2\",\"line_3\",\"line_4\",\"line_5\",\"line_6\",\"line_7\",\"line_8\",\"line_9\",\"line_10\""| __truncated__
# $ : chr "01\",\"line_502\",\"line_503\",\"line_504\",\"line_505\",\"line_506\",\"line_507\",\"line_508\",\"line_509\",\""| __truncated__
However the line breaks are not meaningful with this approach, you'll have to look at the structure of your json data to work out where to put the chunks back together.
This will depend on the format your exact data but in the absence of that I will replicate a base64 encoded version of mtcars
a thousand times as an example:
replicate(1e3, mtcars, simplify = FALSE) |>
do.call(rbind, args = _) |>
jsonlite::toJSON() |>
jsonlite::base64_enc() |>
writeLines("tmp.txt")
# Read it back in
chunk_list <- read_file_in_chunks("tmp.txt", 100)
This is basically json array of rows in the following format:
[{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.62,"qsec":16.46,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4"},
{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.875,"qsec":17.02,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4 Wag"}
...
]
We can write a function to parse these rows. The important thing is to pass the last, incomplete line of each one to the next chunk, where we prepend it.
parse_chunk <- function(chunk, prev_chunk_tail = NULL, last_chunk = FALSE) {
if (is.null(prev_chunk_tail)) { # first chunk
hex_to_parse <- chunk
start_chr <- 2 # remove opening `[` in first chunk
} else {
hex_to_parse <- c(prev_chunk_tail, chunk)
start_chr <- 1
}
txt_to_parse <- hex_to_parse |>
jsonlite::base64_dec() |>
intToUtf8()
# In last chunk remove `[` and do not
# cut off final string
if (last_chunk) {
end_chr <- nchar(txt_to_parse) - 1
head_n <- Inf
} else {
end_chr <- nchar(txt_to_parse)
head_n <- -1
}
txt_split <- txt_to_parse |>
substr(start_chr, end_chr) |>
strsplit("(?<=}),(?={)", perl = TRUE) |>
el(1)
df <- txt_split |>
head(head_n) |> # cut off last (incomplete) string
lapply(\(str) jsonlite::fromJSON(str)) |>
do.call(rbind, args = _)
return(list(
df = df,
prev_chunk_tail = jsonlite::base64_enc(tail(txt_split, 1))
))
}
Then loop over the chunks, passing the final incomplete line to next chunk:
df_list <- vector(mode = "list", length = length(chunk_list))
for (i in seq(chunk_list)) {
if (i == 1) {
res <- parse_chunk(chunk_list[[i]])
} else if (i == length(chunk_list)) {
res <- parse_chunk(chunk_list[[i]], prev_chunk_tail = res$prev_chunk_tail, last_chunk = TRUE)
} else {
res <- parse_chunk(chunk_list[[i]], prev_chunk_tail = res$prev_chunk_tail)
}
df_list[[i]] <- res$df
}
This will piece back together the json chunks.
out_df <- data.frame(do.call(rbind, df_list))
dim(out_df) # 32000, 12
head(out_df)
# mpg cyl disp hp drat wt qsec vs am gear carb X_row
# 1 21 6 160 110 3.9 2.62 16.46 0 1 4 4 Mazda RX4
# 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4 Mazda RX4 Wag
# 3 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 Datsun 710
# 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive
# 5 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 Hornet Sportabout
# 6 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 Valiant
Upvotes: 1