user2122466
user2122466

Reputation: 45

Spark Scala unable to parse Wikipedia Data : enwiki_latest_articles xml bz2

I am trying to do topic modelling on the wikipedia data using the spark LDA algorithm : The input file is basically a large bz2 file with a lot of xml files. I am using the basic spark scala code on the spark website :

val sc:SparkContext = new SparkContext(conf);
val ssqlc:SQLContext = new org.apache.spark.sql.SQLContext(sc);
val shsqlc:HiveContext = new org.apache.spark.sql.hive.HiveContext(sc);

// Load and parse the data

val data = sc.textFile("/user/enwiki-latest-pages-articles.xml.bz2")

//val datanew = data.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }



 val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))
    // Index documents with unique IDs
    val corpus = parsedData.zipWithIndex.map(_.swap).cache()
// Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(25).run(corpus)

// Output topics. Each is a distribution over words (matching word count vectors)
println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize + " words):")
val topics = ldaModel.topicsMatrix
for (topic <- Range(0, 25)) {
  print("Topic " + topic + ":")
  for (word <- Range(0, ldaModel.vocabSize)) { print(" " + topics(word, topic)); }
  println()
// val newtopics = ldaModel.describeTopics(5).foreach(println)



}

It doesnt process the data and throws errors such as :

ERROR executor.Executor: Exception in task 5.0 in stage 0.0 (TID 2) java.lang.NumberFormatException: empty String 16/07/28 09:24:35 ERROR executor.Executor: Exception in task 10.0 in stage 0.0 (TID 5) java.lang.NumberFormatException: For input string: "|" 16/07/28 09:24:35 ERROR executor.Executor: Exception in task 7.0 in stage 0.0 (TID 3)java.lang.NumberFormatException: For input string: "|}"

Can someone please help me with this? A brief code to enhance this will help. Thank you in advance.

Upvotes: 0

Views: 387

Answers (1)

Jean Logeart
Jean Logeart

Reputation: 53839

You problem is that your data contains strings that are not numbers. Therefore, this is failing:

s.trim.split(' ').map(_.toDouble)

You need clean up your data or extract only the numerical fields you are interested in.

Upvotes: 0

Related Questions