Reputation: 1433

Pyspark - error: "index out of range" on .count()

Using Spark 1.5 and I keep getting index out of range when I invoke .count(), .top(), .take(x)

lateWest = westBound.filter(lambda line: line.split(',')[16] > 0)
print(type(lateWest))<class 'pyspark.rdd.PipelinedRDD'>
lateWest.count()

lateWest.first()

lateWest.take(3)

Any ideas why I am getting this error. I'm guessing it's because lateWest is empty as a result of the first command. But how can I check if it is empty?

Upvotes: 0

Answers (1)

Charlie Haley

Reputation: 4310

Spark operates using a concept called lazy evaluation. So when you run the first line, the system doesn't actually run your lambda function, it just stores it in a spark object. When you invoke the count() function, spark runs the lambda function in your filter. And that's where the error actually occurs. So in other words, your error is telling you that you have at least one input line that doesn't have 16 commas.

Upvotes: 1

Pyspark - error: &quot;index out of range&quot; on .count()

Answers (1)

Related Questions

Pyspark - error: "index out of range" on .count()