How to read first n rows without loading entire file?

Question

Have some XML and regular text files that are north of 2 gigs. Loading the entire file into memory everytime I want to try something out in Spark takes too long on my machine.

Is there a way to read only a portion of the file (similar to running a SQL command against a large table and only getting a few rows without it taking forever)?

Shaido · Accepted Answer

You can restrict the number of rows to n while reading a file by using limit(n).

For csv files it can be done as:

spark.read.csv("/path/to/file/").limit(n)

and text files as:

spark.read.text("/path/to/file/").limit(n)

Running explain on the obtained dataframes show that not the whole file is loaded, here with n=3 on an csv file:

== Physical Plan ==
CollectLimit 3
...

How to read first n rows without loading entire file?

Answers (1)

Related Questions