Raja Saha
Raja Saha

Reputation: 509

How to read parquet file as R data.frame without any other dependencies (like spark, python etc)?

I need to read some 'paraquet' files in R. There are few solution using

  1. sparklyr:: spark_read_parquet (which required 'spark')
  2. reticulate (which need python)

Now the problem is I am not allowed to install any tool other than R. Is there any package available in R which can read 'paraquet' without using any other tool?

Upvotes: 5

Views: 5547

Answers (2)

Gabor Csardi
Gabor Csardi

Reputation: 10825

Five years later, but perhaps worth noting that you can now read and write (flat) Parquet files with the nanoparquet R package, which is pretty small and easy to install:

install.packages("nanoparquet")
library(nanoparquet)
write_parquet(mtcars, "mtcars.parquet")
read_parquet("mtcars.parquet")

See more at https://r-lib.github.io/nanoparquet/

Upvotes: 1

Uwe L. Korn
Uwe L. Korn

Reputation: 8796

You can use arrow for this (the same thing as in Python pyarrow) but this nowadays also comes packaged for R (without the need for Python). As it is not yet available on CRAN, you have to manually install Arrow C++ first:

git clone https://github.com/apache/arrow.git
cd arrow/cpp && mkdir release && cd release

# It is important to statically link to boost libraries
cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
make install

Then you can install the R arrow package:

devtools::install_github("apache/arrow/r")

And use it to load a Parquet file

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
#> The following objects are masked from 'package:base':
#> 
#>     array, table
read_parquet("somefile.parquet", as_tibble = TRUE)
#> # A tibble: 10 x 2
#>        x       y
#>    <int>   <dbl>
#> …

Edit (22/9/2019)

It is now available on CRAN, install using install.packages("arrow")

Upvotes: 6

Related Questions