Aavik
Aavik

Reputation: 1037

Is there a better way of converting Iterator[char] to Seq[String]?

Following is my code that I have used to convert Iterator[char] to Seq[String].

val result = IOUtils.toByteArray(new FileInputStream (new File(fileDir)))
val remove_comp = result.grouped(11).map{arr => arr.update(2, 32);arr}.flatMap{arr => arr.update(3, 32); arr}
val convert_iter = remove_comp.map(_.toChar.toString).toSeq.mkString.split("\n")
val rdd_input = Spark.sparkSession.sparkContext.parallelize(convert_iter)

val fileDir:

12**34567890
12@@34567890
12!!34567890
12¬¬34567890
12
'34567890

I am not happy with this code as the data size is big and converting to string would end up with heap space.

val convert_iter = remove_comp.map(_.toChar)
convert_iter: Iterator[Char] = non-empty iterator

Is there a better way of coding?

Upvotes: 0

Views: 462

Answers (2)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

Looking at your code, I see that you are trying to replace the special characters such as **, @@ and so on from the file that contains following data

12**34567890 12@@34567890 12!!34567890 12¬¬34567890 12 '34567890

For that you can just read the data using sparkContext textFile and use regex replaceAllIn

val pattern = new Regex("[¬~!@#$^%&*\\(\\)_+={}\\[\\]|;:\"'<,>.?` /\\-]")
val result = sc.textFile(fileDir).map(line => pattern.replaceAllIn(line, ""))

and you should have you result as RDD[String] which also an iterator

1234567890
1234567890
1234567890
1234567890
12
34567890

Updated

If there are \n and \r in between the texts at 3rd and 4th place and if the result is all fixed length of 10 digits text then you can use wholeTextFiles api of sparkContext and use following regex as

val pattern = new Regex("[¬~!@#$^%&*\\(\\)_+={}\\[\\]|;:\"'<,>.?` /\\-\r\n]")
val result = sc.wholeTextFiles(fileDir).flatMap(line => pattern.replaceAllIn(line._2, "").grouped(10))

You should get the output as

1234567890
1234567890
1234567890
1234567890
1234567890

I hope the answer is helpful

Upvotes: 0

Mateusz Kubuszok
Mateusz Kubuszok

Reputation: 27535

By completely disregarding corner cases about empty Strings etc I would start with something like:

val test = Iterable('s','f','\n','s','d','\n','s','v','y')

val (allButOne, last) = test.foldLeft( (Seq.empty[String], Seq.empty[Char]) ) {
  case ((strings, chars), char) =>
    if (char == '\n')
      (strings :+ chars.mkString, Seq.empty)
    else
      (strings, chars :+ char)
}

val result = allButOne :+ last.mkString

I am sure it could be made more elegant, and handle corner cases better (once you define you want them handled), but I think it is a nice starting point.

But to be honest I am not entirely sure what you want to achieve. I just guessed that you want to group chars divided by \n together and turn them into Strings.

Upvotes: 1

Related Questions