Break a large tar.gz file into multiple smaller tar.gz files

Question

I get OutOfMemoryError when processing tar.gz files greater than 1gb in spark.

To get past this error I have tried splitting the tar.gz into multiple parts using the 'split' command only to find out that each split is not a tar.gz on its own and so cannot be processed as such.

dir=/dbfs/mnt/data/temp
b=524288000
for file in /dbfs/mnt/data/*.tar.gz; 
do 
a=$(stat -c%s "$file");
if [[ "$a" -gt "$b" ]] ; then 
split -b 500M -d --additional-suffix=.tar.gz $file "${file%%.*}_part"
mv $file $dir
fi
done

Error when trying to process the split files

Caused by: java.io.EOFException
    at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.read(GzipCompressorInputStream.java:281)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.read(TarArchiveInputStream.java:590)
    at org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:98)
    at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
    at java.io.InputStreamReader.read(InputStreamReader.java:184)
    at java.io.Reader.read(Reader.java:140)
    at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2001)
    at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1980)
    at org.apache.commons.io.IOUtils.copy(IOUtils.java:1957)
    at org.apache.commons.io.IOUtils.copy(IOUtils.java:1907)
    at org.apache.commons.io.IOUtils.toString(IOUtils.java:778)
    at org.apache.commons.io.IOUtils.toString(IOUtils.java:803)
    at linea3796c25fa964697ba042965141ff28825.$read$$iw$$iw$$iw$$iw$$iw$$iw$Unpacker$$anonfun$apply$1.apply(command-2152765781429277:33)
    at linea3796c25fa964697ba042965141ff28825.$read$$iw$$iw$$iw$$iw$$iw$$iw$Unpacker$$anonfun$apply$1.apply(command-2152765781429277:31)
    at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
    at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
    at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233)
    at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223)
    at scala.collection.immutable.Stream.foreach(Stream.scala:595)
    at scala.collection.TraversableOnce$class.toMap(TraversableOnce.scala:316)
    at scala.collection.AbstractTraversable.toMap(Traversable.scala:104)
    at linea3796c25fa964697ba042965141ff28825.$read$$iw$$iw$$iw$$iw$$iw$$iw$Unpacker$.apply(command-2152765781429277:34)
    at linea3796c25fa964697ba042965141ff28827.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-2152765781429278:3)
    at linea3796c25fa964697ba042965141ff28827.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-2152765781429278:3)

I have tar.gz files that go upto 4gb in size and each of these can contain upto 7000 json documents whose sizes vary from 1mb to 50mb.

If I want to divide the large tar.gz files into smaller tar.gz files is my only option to decompress and somehow then recompress based on file size or file count? - "is this the way?"

Break a large tar.gz file into multiple smaller tar.gz files

Answers (1)

Related Questions