How to Open "GZ FILE" using sparklyr in R?

Question

I'd like to open gz file using sparklyr package since I'm using Spark on R. I know that I can use read.delim2(gzfile("filename.csv.gz"), sep = ",", header = FALSE) to open gz file, and I can use spark_read_csv to open csv file but neither works when I tried to open the gz file in Spark. Please help!

zero323 · Accepted Answer

Default Spark readers can load gzipped data transparently, without any additional configuration, as long a the file has proper extension indicating compression used.

So if you have a gzipped file (note that such setup will work only in local mode. In distributed mode you need shared storage) like this:

valid_path <- tempfile(fileext=".csv.gz")
valid_conn <- gzfile(valid_path, "w")
readr::write_csv(iris, valid_conn)
close(valid_conn )

spark_read_csv will work just fine:

spark_read_csv(sc, "valid", valid_path)

# Source: spark [?? x 5]
   Sepal_Length Sepal_Width Petal_Length Petal_Width Species
                                   
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa

However this

invalid_path <- tempfile(fileext=".csv")
invalid_conn <- gzfile(invalid_path, "w")
readr::write_csv(iris, invalid_conn)
close(invalid_conn)

won't, as Spark will read data as-is

spark_read_csv(sc, "invalid", invalid_path)

Also please keep in mind, that gzip is not splittable, and as such a poor choice for distributed applications. So if the file is large, it typically makes sense to unpack it using standard system tools, before you proceed with Spark.

How to Open "GZ FILE" using sparklyr in R?

Answers (1)

Related Questions

How to Open &quot;GZ FILE&quot; using sparklyr in R?

Answers (1)

Related Questions

How to Open "GZ FILE" using sparklyr in R?