Regex on io.Text RDD using scala

Question

I have a problem. I need to extract some data from a file like this:

(3269,

Anarchism
0
12
...
)
(194712,

AssistiveTechnology
0
23.. 
) etc...

This file was generated using:

val conf = new Configuration
conf.set("textinputformat.record.delimiter", "")
val rdd=sc.newAPIHadoopFile("sample.bz2", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
rdd.map{case (k,v) => (k.get(), new String(v.copyBytes()))}

I need to obtain the title content. Im using regex but the output file still remains empty. My code is like this:

val xx = rdd.map(x => x._2).filter(x => x.matches(".*([A-Za-z]+)<\/title>.*"))
</code></pre>

<p>I also try with these:</p>

<pre><code>".*<title>([A-Za-z]+).*"

And using this:

val reg = ".*([\w]+).*".r
val xx = rdd.map(x => x._2).filter(x => reg.pattern.matcher(x).matches)

I create the .jar using sbt and running with spark-submit.

BTW, using spark-shell it works :S

I need your help please. Thanks.

Regex on io.Text RDD using scala

Answers (1)

Related Questions