Reputation: 1659
It is generally said that any compression format like Gzip, when used along with a container file format like avro and sequence (file formats), will make the compression format splittable.
Does this mean that the blocks in the container format get compressed based on the preferred compression (like gzip) or something else. Can someone please explain this? Thanks!
Well, I think the question requires an update.
Update:
Do we have a straightforward approach to convert a large file in a non-splittable file compression format (like Gzip) into a splittable file (using a container file format such as Avro, Sequence or Parquet) to be processed by MapReduce?
Note: I do not mean to ask for workarounds such as uncompressing the file, and again compressing the data using a splittable compression format.
Upvotes: 3
Views: 1133
Reputation: 1
Do not know what you are really talking about... but any file can be splitted at any point.
Why i say this... hoping you are using something like Linux or similar.
On Linux it is (not too much) easy to create a block device that is really stored on the concatenation of some files.
I mean:
That way you can have a BIG file (more than 16GiB, more than 4GiB) stored on one FAT32 partition (that has a limit of 4GiB-1 bytes per file)... and access it on-the-fly and transparently... thinking only on read.
For read/write... there is a trick (that is the complex part) that works:
On Linux i had used on the past some tools (command line) that do all the job, and they let you create a virtual container resizable on the fly, that will use files of an exact size (including the last one) and exposes it as a regular block device (where you can do a dd if=... of=... to fill it) and a virtual file associated with it.
That way you have:
Maybe that gives you idea on other aproach to the problem you are having:
Such tools exist, i do not remember the name, sorry! But i remember the one for read only (dvd_double_layer.* are on a FAT32):
# cd /mnt/FAT32
# ls -lh dvd_double_layer.*
total #
-r--r--r-- 1 root root 3.5G 2017-04-20 13:10 dvd_double_layer.000
-r--r--r-- 1 root root 3.5G 2017-04-20 13:11 dvd_double_layer.001
-r--r--r-- 1 root root 0.2G 2017-04-20 13:12 dvd_double_layer.002
# affuse dvd_double_layer.000 /mnt/transparent_concatenated_on_the_fly
# cd /mnt/transparent_concatenated_on_the_fly
# ln -s dvd_double_layer.000.raw dvd_double_layer.iso
# ls -lh dvd_double_layer.*
total #
-r--r--r-- 1 root root 7.2G 2017-04-20 13:13 dvd_double_layer.000.raw
-r--r--r-- 1 root root 7.2G 2017-04-20 13:14 dvd_double_layer.iso
Hope this idea can help you.
Upvotes: 0
Reputation: 13927
For Sequence files if you specify BLOCK
compression, each block will be compressed using the specified compression codec. Blocks allow Hadoop to split data at the block level, while using compression (where the compression itself isn't splitable) and skip whole blocks without needing to decompress them.
Most of this is described on the Hadoop wiki: https://wiki.apache.org/hadoop/SequenceFile
Block compressed key/value records - both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.
For Avro this is all very similar as well: https://avro.apache.org/docs/1.7.7/spec.html#Object+Container+Files
Objects are stored in blocks that may be compressed. Syncronization markers are used between blocks to permit efficient splitting of files for MapReduce processing.
Thus, each block's binary data can be efficiently extracted or skipped without deserializing the contents.
The easiest (and usually fastest) way to convert data from one format into another is to let MapReduce do the work for you. In the example of:
GZip Text -> SequenceFile
You would have a map only job that uses the TextInputFormat
for input and outputs SequenceFileFormat
. This way you get a 1-to-1 conversion on the number of files (add a reduce step if this needs changing) and you do the conversion in parallel if there are lots of files to convert.
Upvotes: 1