Reputation: 8758
I've got text files sitting in HDFS, ranging in size from around 300-800 MB each. They are almost valid json files. I am trying to make them valid json files so I can save them as ORC files.
I am attempting to create a StringBuilder with the needed opening characters, then read the file in line by line stripping off the newlines, append each line the string builder, and then add the needed closing character.
import org.apache.hadoop.fs.{FileSystem,Path, PathFilter, RemoteIterator}
import scala.collection.mutable.StringBuilder
//create stringbuilder
var sb = new scala.collection.mutable.StringBuilder("{\"data\" : ")
//read in the file
val path = new Path("/path/to/crappy/file.json")
val stream = fs.open(path)
//read the file line by line. This will strip off the newlines so we can append it to the string builder
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))
readLines.takeWhile(_ != null).foreach(line => sb.append(line)
That works. But as soon as I try to append the closing }:
sb.append("}")
It crashes with out of memory:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala
...
I've tried setting the initial size of the stringbuilder to be larger than the file I'm currently testing with, but that didn't help. I've also tried giving the driver more memory (spark-shell --driver-memory 3g
), didn't help either.
Is there a better way to do this?
Upvotes: 0
Views: 134
Reputation: 9425
If that's all you need, you can just do it without Scala via hdfs command-line:
hadoop fs -cat /hdfs/path/prefix /hdfs/path/badjson /hdfs/path/suffix | hadoop fs -put - /hdfs/path/properjson
where file prefix
just contains {"data" :
, and suffix
- a single }
.
Upvotes: 1
Reputation: 4017
1) Don't use scala's Stream
. It is just a broken abstraction. It's extremely difficult to use infinite/huge stream without blowing-up the heap. Stick either with a plain old Iterator
or use more principled approaches from fs2
/ zio
.
In your case readLines
object accumulates all entries even though it expects to hold only one at a time.
2) sb
object leaks as well. It accumulates entire file content in memory.
Consider writing the corrected content directly into some OutputStreamWriter
.
Upvotes: 0