Reputation: 1204
I realise parquet is a column format, but with large files, sometimes you don't want to read it all to memory in R before filtering, and the first 1000 or so rows may be enough for testing. I don't see an option in the read parquet documentation here.
I see a solution for pandas here, and an option for c# here, both of which are not obvious to me how they might translate to R. Suggestions?
Upvotes: 5
Views: 4037
Reputation: 1204
Thanks to Jon and Dan for pointing in the right direction.
arrow::open_dataset()
allows lazy evaluation (docs here), which you can then get the head()
from (but not slice()
), or filter()
. This process is faster, and uses much less peak ram. Example below.
# https://stackoverflow.com/questions/73131505/r-reading-first-n-rows-from-parquet-file
library(dplyr)
library(arrow)
library(tictoc) # optional, used to time results
tic("read all of large parquet file")
my_animals <- read_parquet("data/my_animals.parquet")
toc() # slow and uses heaps of ram
tic("read parquet and write mini version")
my_animals <- open_dataset("data/my_animals.parquet")
my_animals # this is a lazy object
my_animals %>%
#slice(1000L) %>% # doesn't work
head(n=1000L) %>%
# filter(YEAROFBIRTH >= 2010) %>% # also works
compute() %>%
write_parquet("data/my_animals_mini.parquet") # optional
toc() # much faster, much less peak ram used
Upvotes: 6
Reputation: 1011
You can use the as_data_frame
argument of read_parquet
to return the data as an 'Arrow Table' object. You can then use {dplyr}
functions on this object, followed by dplyr::collect
(collect
will return the tibble object, whereas compute
merely forces the computation).
library(dplyr)
library(arrow)
my_animals <- read_parquet("data/my_animals.parquet", as_data_frame = FALSE) |>
slice_head(n = 1000) |>
collect()
This is readable, fast and memory efficient!
See https://arrow.apache.org/docs/r/articles/data_wrangling.html for more info.
Upvotes: 2
Reputation: 37
I published this simple package for practical usage. https://github.com/mkparkin/Rinvent feel free to check if that can help. There is a parameter called "sample" which brings sample rows. also it can read "delta" files as well
readparquetR(pathtoread="C:/users/...", format="delta", sample=10) or readparquetR(pathtoread="C:/users/...", format="parquet", sample=10)
Upvotes: 0