Reputation: 30815
So I'm following the tutorial on spark using scala, and working with this dataset from wikimedia. I was interested in generating a histogram of total page views by language. The first column is language, while the third column is page views. However, it seems that some lines in that database do not have any field for the third column, as I get ArrayIndexOutOfBondException
error when I run the following code.
scala> val tuples = pagecounts.map(line => line.split(" "))
scala> val keyValuePairs = tuples.map(line => (line(0).substring(0, 2),
line(2).toInt))
scala> keyValuePairs.reduceByKey(_+_, 1).collect
Does anyone have an idea, how to ignore the lines which have missing fields for the third column, so that I can run query against only those lines which contain the field for the third column in the database?
Upvotes: 0
Views: 250
Reputation: 8227
You want to filter the page counts so that only the ones with 3 fields are being operated on. Use filter
to select just those:
val tuples = pagecounts.map(line => line.split(" ").filter(_.length == 3))
Upvotes: 2