Reputation: 11629
I can see that pig can read .bz2 files natively but I am not sure whether it runs an explicit job to split bz2 into multiple inputsplits? Can anyone confirm this? If pig is running a job to create inputsplits, is there a way to avoid that? I mean a way to have MapReduce framework split bz2 files into muplitple inputslits in the framework level?
Upvotes: 2
Views: 1063
Reputation: 30089
Splittable input formats are not implemented in hadoop (or in pig, which just runs MR jobs for you) such that a file is split by one job, then the splits processed by a second job.
The input format defines an isSplittable
method which defines whether in principal the file format can be split. In addition to this, most text based formats will check to see whether the file is using a known compression codec (for example: gzip, bzip2) and if the codec support splits (gzip doesn't, in principal, but bz2 does).
If the input format / codec does allow for splitting of the files, then splits are defined at defined (and configurable) points in the compressed file (say every 64 MB). When the map tasks are created to process each split, then get the input format to create a record reader for the file, passing the split information for where the reader should start from (the 64MB block offset). The reader is then told to seek to the offset point of the split. At this point the underlying codec will seek to that point in the compressed file, and scan forward until it finds the next compressed block header (in the case of bz2). Reads then continue as normal on the uncompressed stream returned from the codec, until the split end point has been passed over in the uncompressed stream.
Upvotes: 1