Reputation: 933
Is there a way to read a specific number of lines from a parquet file? Something similar to nrows
of fread
from data.table
. I have a huge data that would take too long to read, but I just want to analyze its structure and integrity.
I need to read just some rows of my parquet data, and seems like something not possible to do using the Sparklyr's function spark_read_parquet
.
Upvotes: 1
Views: 1426
Reputation: 2469
Since the spark_read_xxx
family function returns a Spark DataFrame, you can always filter and collect the results after reading the file, using the %>%
operator. For instance, if you just wanted the first 2 lines of the file, you could do something like this:
DF <- spark_read_csv(sc, name = "mtcars", path = "R/mtcars.csv", header = FALSE, delimiter = ";")
DF %>% head(2) %>% dplyr::collect()
# A tibble: 2 x 12
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
<chr> <chr> <int> <chr> <int> <chr> <chr> <chr> <int> <int> <int> <int>
1 Mazda RX4 21 6 160 110 3,9 2,62 16,46 0 1 4 4
2 Mazda RX4 Wag 21 6 160 110 3,9 2,875 17,02 0 1 4 4
I'm using the spark_read_csv
function here, but the result should be the same with spark_read_parquet
since both functions return the same structure.
Upvotes: 1