Andrew
Andrew

Reputation: 8758

Read large file into StringBuilder

I've got text files sitting in HDFS, ranging in size from around 300-800 MB each. They are almost valid json files. I am trying to make them valid json files so I can save them as ORC files.

I am attempting to create a StringBuilder with the needed opening characters, then read the file in line by line stripping off the newlines, append each line the string builder, and then add the needed closing character.

import org.apache.hadoop.fs.{FileSystem,Path, PathFilter, RemoteIterator}
import scala.collection.mutable.StringBuilder

//create stringbuilder
var sb = new scala.collection.mutable.StringBuilder("{\"data\" : ")
//read in the file
val path = new Path("/path/to/crappy/file.json")
val stream = fs.open(path)
//read the file line by line. This will strip off the newlines so we can  append it to the string builder
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
readLines.takeWhile(_ != null).foreach(line => sb.append(line)

That works. But as soon as I try to append the closing }:

sb.append("}")

It crashes with out of memory:

java.lang.OutOfMemoryError: Java heap space
  at java.util.Arrays.copyOf(Arrays.java:3332)
  at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
  at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
  at java.lang.StringBuilder.append(StringBuilder.java:136)
  at scala.collection.mutable.StringBuilder.append(StringBuilder.scala
...

I've tried setting the initial size of the stringbuilder to be larger than the file I'm currently testing with, but that didn't help. I've also tried giving the driver more memory (spark-shell --driver-memory 3g), didn't help either.

Is there a better way to do this?

Upvotes: 0

Views: 134

Answers (2)

mazaneicha
mazaneicha

Reputation: 9425

If that's all you need, you can just do it without Scala via hdfs command-line:

hadoop fs -cat /hdfs/path/prefix /hdfs/path/badjson /hdfs/path/suffix | hadoop fs -put - /hdfs/path/properjson

where file prefix just contains {"data" :, and suffix - a single }.

Upvotes: 1

simpadjo
simpadjo

Reputation: 4017

1) Don't use scala's Stream. It is just a broken abstraction. It's extremely difficult to use infinite/huge stream without blowing-up the heap. Stick either with a plain old Iterator or use more principled approaches from fs2 / zio.

In your case readLines object accumulates all entries even though it expects to hold only one at a time.

2) sb object leaks as well. It accumulates entire file content in memory. Consider writing the corrected content directly into some OutputStreamWriter.

Upvotes: 0

Related Questions