How to ignore lines with missing fields in the database

Question

So I'm following the tutorial on spark using scala, and working with this dataset from wikimedia. I was interested in generating a histogram of total page views by language. The first column is language, while the third column is page views. However, it seems that some lines in that database do not have any field for the third column, as I get ArrayIndexOutOfBondException error when I run the following code.

scala> val tuples = pagecounts.map(line => line.split(" "))
scala> val keyValuePairs = tuples.map(line => (line(0).substring(0, 2), 
  line(2).toInt))
scala> keyValuePairs.reduceByKey(_+_, 1).collect

Does anyone have an idea, how to ignore the lines which have missing fields for the third column, so that I can run query against only those lines which contain the field for the third column in the database?

Bob Dalgleish · Accepted Answer

You want to filter the page counts so that only the ones with 3 fields are being operated on. Use filter to select just those:

val tuples = pagecounts.map(line => line.split(" ").filter(_.length == 3))

How to ignore lines with missing fields in the database

Answers (1)

Related Questions