Igor
Igor

Reputation: 933

R - Read part of parquet files

Is there a way to read a specific number of lines from a parquet file? Something similar to nrows of fread from data.table. I have a huge data that would take too long to read, but I just want to analyze its structure and integrity.

I need to read just some rows of my parquet data, and seems like something not possible to do using the Sparklyr's function spark_read_parquet.

Upvotes: 1

Views: 1426

Answers (1)

Jaime Caffarel
Jaime Caffarel

Reputation: 2469

Since the spark_read_xxx family function returns a Spark DataFrame, you can always filter and collect the results after reading the file, using the %>% operator. For instance, if you just wanted the first 2 lines of the file, you could do something like this:

DF <- spark_read_csv(sc, name = "mtcars", path = "R/mtcars.csv", header = FALSE, delimiter = ";")

DF %>% head(2) %>% dplyr::collect()
# A tibble: 2 x 12
             V1    V2    V3    V4    V5    V6    V7    V8    V9   V10   V11   V12
          <chr> <chr> <int> <chr> <int> <chr> <chr> <chr> <int> <int> <int> <int>
1     Mazda RX4    21     6   160   110   3,9  2,62 16,46     0     1     4     4
2 Mazda RX4 Wag    21     6   160   110   3,9 2,875 17,02     0     1     4     4

I'm using the spark_read_csv function here, but the result should be the same with spark_read_parquet since both functions return the same structure.

Upvotes: 1

Related Questions