Gerg
Gerg

Reputation: 336

How to read a parquet file in R without using spark packages?

I could find many answers online by using sparklyr or using different spark packages which actually requires spinning up a spark cluster which is an overhead. In python I could find a way to do this using "pandas.read_parquet" or Apache arrow in python - I am looking for something similar to this.

Upvotes: 15

Views: 5837

Answers (2)

fc9.30
fc9.30

Reputation: 2571

You can simply use the arrow package:

install.packages("arrow")
library(arrow)
read_parquet("myfile.parquet")

Upvotes: 4

Jonathan
Jonathan

Reputation: 641

With reticulate you can use pandas from python to read parquet files. This could save you the hassle from running a spark instance. May lose performance in serialization till apache arrow releases their version. As above comment mentioned.

library(reticulate)
library(dplyr)
pandas <- import("pandas")
read_parquet <- function(path, columns = NULL) {

  path <- path.expand(path)
  path <- normalizePath(path)

  if (!is.null(columns)) columns = as.list(columns)

  xdf <- pandas$read_parquet(path, columns = columns)

  xdf <- as.data.frame(xdf, stringsAsFactors = FALSE)

  dplyr::tbl_df(xdf)

}

read_parquet(PATH_TO_PARQUET_FILE)

Upvotes: 2

Related Questions