abcd efg
abcd efg

Reputation: 71

How to do random access of specific blocks in an .xz file?

My goal is to be able to reduce time needed to look at specific sections from the middle of very large log files compressed to .xz format.

If the .xz files are for example 6GB compressed and 60GB uncompressed, using simple commands like xzcat <file> | tail -1 to simply look at the last line of the uncompressed file, you'd have to wait many minutes for the entire file to get decompressed.


From reading https://stackoverflow.com/a/34053829/12132601, my understanding is that .xz files are organised into blocks and it is possible to decompress specific blocks, if you can find the right starting position and length of the file to take. However I could not follow this:

You can get the list of block offsets with xz --verbose --list FILE.xz. If you want the last block, you need its compressed size (column 5) plus 36 bytes for overhead (found by comparing the size to hd big.log.sp0.xz |grep 7zXZ). Fetch that block using tail -c and pipe that through xz. Since the above question wants the last line of the file, I then pipe that through tail -n1:

SIZE=$(xz --verbose --list big.log.sp.xz |awk 'END { print $5 + 36 }')
tail -c $SIZE big.log.sp.xz |unxz -c |tail -n1

Specifically the part about the overhead of 36 and how he got it.

plus 36 bytes for overhead (found by comparing the size to hd big.log.sp0.xz |grep 7zXZ)

I've been reading https://tukaani.org/xz/xz-file-format.txt but I could not follow a lot of it. I did not find out where the 36 came from.

36 definitely did NOT work with the my file. I actually tried 1 to 100 and none worked.


The first 3 lines of my file looks like this with hd:

00000000  fd 37 7a 58 5a 00 00 04  e6 d6 b4 46 04 c0 e2 c3  |.7zXZ......F....|
00000010  39 80 80 80 08 21 01 14  00 00 00 00 3e 0b 39 68  |9....!......>.9h|
00000020  e9 e2 3f f0 00 5d 00 18  8d 82 f9 18 7b b2 75 c6  |..?..]......{.u.|

And the first few lines xz -lvv <myxzfile> looks like this:

<myxzfile> (1/1)
  Streams:            1
  Blocks:             4,080
  Compressed size:    5,789.9 MiB (6,071,150,860 B)
  Uncompressed size:  63.7 GiB (68,443,750,160 B)
  Ratio:              0.089
  Check:              CRC64
  Stream padding:     0 B
  Streams:
    Stream    Blocks      CompOffset    UncompOffset        CompSize      UncompSize  Ratio  Check      Padding
         1     4,080               0               0   6,071,150,860  68,443,750,160  0.089  CRC64            0
  Blocks:
    Stream     Block      CompOffset    UncompOffset       TotalSize      UncompSize  Ratio  Check      CheckVal          Header  Flags        CompSize    MemUsage  Filters
         1         1              12               0         942,592      16,777,216  0.056  CRC64      e77988a5264b499e      20  cu            942,562       5 MiB  --lzma2=dict=4MiB
         1         2         942,604      16,777,216         887,748      16,777,216  0.053  CRC64      b1124241f57be325      20  cu            887,718       5 MiB  --lzma2=dict=4MiB
         1         3       1,830,352      33,554,432         836,008      16,777,216  0.050  CRC64      0b9ed8b7bd1be895      20  cu            835,978       5 MiB  --lzma2=dict=4MiB
         1         4       2,666,360      50,331,648         893,172      16,777,216  0.053  CRC64      4399327c125c6a13      20  cu            893,144       5 MiB  --lzma2=dict=4MiB
         1         5       3,559,532      67,108,864         757,964      16,777,216  0.045  CRC64      908e32d2276f5b4b      20  cu            757,933       5 MiB  --lzma2=dict=4MiB

If I want to decompress just the 3rd block, naively I would think head -c 2666360 2022-06-16T00:00:00.xz | tail -c 836008 | unxz -c would work but of course it doesn't. What is the starting position and length of the file I should be taking, and why?

Upvotes: 2

Views: 1075

Answers (2)

rogdham
rogdham

Reputation: 238

When decompressing a file, the unxz (or xz -d) command by defaults tries to auto-detect the type of the archive (equivalent to --format=auto). This works with xz files (but needs the xz stream header at the beginning).

But if you cut an xz file to take only one block, it is not a valid xz file anymore as it misses the xz stream header (which is the very first 12 bytes of the xz file, assuming your xz file is made of only one stream), the xz index and xz stream footer.

If however you take the first 12 bytes on the xz file and then append the bytes of one block, you are still missing the xz index and xz stream footer, and tools to decompress the file may or may not support that (*). It seems that the unxz command supports is quite well, so that's one way to do it!

(*) There are two main ways to read an xz file:

  1. In stream mode: decompressing xz blocks as they come (this is what xz is doing), the drawback is that you cannot do random access
  2. If you have the whole file: getting info about the blocks from the xz index & xz stream footer, and reading from the good block

An alternative would be to further cut the data of an xz block to get only the compressed data, and uncompress it with unxz -F raw. However, this has two drawbacks:

  • You need to further look into the block to be able to know how many bytes to cut out
  • You would need to pass the rights filters as command flags

So I would say this is unpractical to do manually.


From reading https://stackoverflow.com/a/34053829/12132601

In that post, instead of having one stream with several blocks in the xz file, they are creating (then reading) an xz file composed of several streams having one block each.

In that case, taking the last stream with tail -c would be enough because a file composed of one stream is a valid xz file.


I did not found any way to easily do random access in an xz file using the xz command line (nor pixz for that matter).

If by chance you are using Python, I take this opportunity to highlight the python-xz library I wrote as a drop-in replacement to lzma to perform random-access transparently on xz files.

In your case, something like this:

import xz

with xz.open('2022-06-16T00:00:00.xz') as f:
    f.seek(33_554_432)  # position is decompressed offset
    print(f.read(0x1000000))

Upvotes: 0

abcd efg
abcd efg

Reputation: 71

It appears I just need to cat the first 12 bytes onto the block I need. i.e.

cat <(head -c 12 2022-06-16T00:00:00.xz) <(head -c 2666360 2022-06-16T00:00:00.xz | tail -c 836008) | unxz -c

Upvotes: 0

Related Questions