Reputation: 11317
Have some XML and regular text files that are north of 2 gigs. Loading the entire file into memory everytime I want to try something out in Spark takes too long on my machine.
Is there a way to read only a portion of the file (similar to running a SQL command against a large table and only getting a few rows without it taking forever)?
Upvotes: 4
Views: 5114
Reputation: 28392
You can restrict the number of rows to n while reading a file by using limit(n)
.
For csv files it can be done as:
spark.read.csv("/path/to/file/").limit(n)
and text files as:
spark.read.text("/path/to/file/").limit(n)
Running explain
on the obtained dataframes show that not the whole file is loaded, here with n=3
on an csv file:
== Physical Plan ==
CollectLimit 3
...
Upvotes: 4