thisisshantzz
thisisshantzz

Reputation: 1097

How do I get a subset of a csv file as a Spark RDD

I am new to Spark and am trying to read a csv file and get the first and second column in the file. The thing though is that the csv file is huge and I am not interested in parsing each and every line in the csv file. Also, running the collect() function might crash the process because the memory might not be enough to support the amount of data being returned. So I was wondering if it is possible to create a RDD with only a subset of the csv data. For example, is it possible to generate a RDD containing lines 10 to 1000 of the csv file and ignore the other lines.

Right now, all I have is

csvdata = sc.textFile("hdfs://nn:port/datasets/sample.csv").map(lambda line: line.split(","))

This basically creates a RDD for the entire csv file. Is it possible to create a RDD from csvdata containing lines 10 to 1000 only?

Thanks a lot for the help offered.

Upvotes: 2

Views: 1207

Answers (2)

burythehammer
burythehammer

Reputation: 379

An RDD isn't data stored in memory, it's an intention to do work on some data. When you call a terminal operation, such as "collect" or "reduce", then Spark processes the data. Spark does some clever optimisation under the hood that limits the amount of work it has to do, based on the history of your operations on the RDD.

(try it yourself by calling some operations on an RDD but not calling a terminal operation. Nothing happens!)

So you can do e.g. (this is Scala but not too dissimilar in python)

val first10results: Array[Array[String]] = sc.textFile(filePath)
      .map(f => f.split(","))
      .take(10)

Spark will know, because of take(10), that you only want the first 10 rows. So it will only take 10 rows from the file! Simple.

Upvotes: 0

zero323
zero323

Reputation: 330283

You can load all and filter by index:

rdd = sc.parallelize(range(0, -10000, -1))
rdd.zipWithIndex().filter(lambda kv: 9 <= kv[1] < 999).keys()

Adjust ranges depending on how you define 10th line.

Upvotes: 2

Related Questions