Thom Rogers
Thom Rogers

Reputation: 1433

Pyspark - error: "index out of range" on .count()

Using Spark 1.5 and I keep getting index out of range when I invoke .count(), .top(), .take(x)

lateWest = westBound.filter(lambda line: line.split(',')[16] > 0)
print(type(lateWest))<class 'pyspark.rdd.PipelinedRDD'>
lateWest.count()

lateWest.first()

lateWest.take(3)

Any ideas why I am getting this error. I'm guessing it's because lateWest is empty as a result of the first command. But how can I check if it is empty?

Upvotes: 0

Views: 2803

Answers (1)

Charlie Haley
Charlie Haley

Reputation: 4310

Spark operates using a concept called lazy evaluation. So when you run the first line, the system doesn't actually run your lambda function, it just stores it in a spark object. When you invoke the count() function, spark runs the lambda function in your filter. And that's where the error actually occurs. So in other words, your error is telling you that you have at least one input line that doesn't have 16 commas.

Upvotes: 1

Related Questions