Reputation: 71
My goal is to be able to reduce time needed to look at specific sections from the middle of very large log files compressed to .xz format.
If the .xz files are for example 6GB compressed and 60GB uncompressed, using simple commands like xzcat <file> | tail -1
to simply look at the last line of the uncompressed file, you'd have to wait many minutes for the entire file to get decompressed.
From reading https://stackoverflow.com/a/34053829/12132601, my understanding is that .xz files are organised into blocks and it is possible to decompress specific blocks, if you can find the right starting position and length of the file to take. However I could not follow this:
You can get the list of block offsets with xz --verbose --list FILE.xz. If you want the last block, you need its compressed size (column 5) plus 36 bytes for overhead (found by comparing the size to
hd big.log.sp0.xz |grep 7zXZ
). Fetch that block using tail -c and pipe that through xz. Since the above question wants the last line of the file, I then pipe that through tail -n1:SIZE=$(xz --verbose --list big.log.sp.xz |awk 'END { print $5 + 36 }') tail -c $SIZE big.log.sp.xz |unxz -c |tail -n1
Specifically the part about the overhead of 36 and how he got it.
plus 36 bytes for overhead (found by comparing the size to
hd big.log.sp0.xz |grep 7zXZ
)
I've been reading https://tukaani.org/xz/xz-file-format.txt but I could not follow a lot of it. I did not find out where the 36 came from.
36 definitely did NOT work with the my file. I actually tried 1 to 100 and none worked.
The first 3 lines of my file looks like this with hd
:
00000000 fd 37 7a 58 5a 00 00 04 e6 d6 b4 46 04 c0 e2 c3 |.7zXZ......F....|
00000010 39 80 80 80 08 21 01 14 00 00 00 00 3e 0b 39 68 |9....!......>.9h|
00000020 e9 e2 3f f0 00 5d 00 18 8d 82 f9 18 7b b2 75 c6 |..?..]......{.u.|
And the first few lines xz -lvv <myxzfile>
looks like this:
<myxzfile> (1/1)
Streams: 1
Blocks: 4,080
Compressed size: 5,789.9 MiB (6,071,150,860 B)
Uncompressed size: 63.7 GiB (68,443,750,160 B)
Ratio: 0.089
Check: CRC64
Stream padding: 0 B
Streams:
Stream Blocks CompOffset UncompOffset CompSize UncompSize Ratio Check Padding
1 4,080 0 0 6,071,150,860 68,443,750,160 0.089 CRC64 0
Blocks:
Stream Block CompOffset UncompOffset TotalSize UncompSize Ratio Check CheckVal Header Flags CompSize MemUsage Filters
1 1 12 0 942,592 16,777,216 0.056 CRC64 e77988a5264b499e 20 cu 942,562 5 MiB --lzma2=dict=4MiB
1 2 942,604 16,777,216 887,748 16,777,216 0.053 CRC64 b1124241f57be325 20 cu 887,718 5 MiB --lzma2=dict=4MiB
1 3 1,830,352 33,554,432 836,008 16,777,216 0.050 CRC64 0b9ed8b7bd1be895 20 cu 835,978 5 MiB --lzma2=dict=4MiB
1 4 2,666,360 50,331,648 893,172 16,777,216 0.053 CRC64 4399327c125c6a13 20 cu 893,144 5 MiB --lzma2=dict=4MiB
1 5 3,559,532 67,108,864 757,964 16,777,216 0.045 CRC64 908e32d2276f5b4b 20 cu 757,933 5 MiB --lzma2=dict=4MiB
If I want to decompress just the 3rd block, naively I would think head -c 2666360 2022-06-16T00:00:00.xz | tail -c 836008 | unxz -c
would work but of course it doesn't. What is the starting position and length of the file I should be taking, and why?
Upvotes: 2
Views: 1075
Reputation: 238
When decompressing a file, the unxz
(or xz -d
) command by defaults tries to auto-detect the type of the archive (equivalent to --format=auto
). This works with xz files (but needs the xz stream header at the beginning).
But if you cut an xz file to take only one block, it is not a valid xz file anymore as it misses the xz stream header (which is the very first 12 bytes of the xz file, assuming your xz file is made of only one stream), the xz index and xz stream footer.
If however you take the first 12 bytes on the xz file and then append the bytes of one block, you are still missing the xz index and xz stream footer, and tools to decompress the file may or may not support that (*). It seems that the unxz
command supports is quite well, so that's one way to do it!
(*) There are two main ways to read an xz file:
xz
is doing), the drawback is that you cannot do random accessAn alternative would be to further cut the data of an xz block to get only the compressed data, and uncompress it with unxz -F raw
. However, this has two drawbacks:
So I would say this is unpractical to do manually.
From reading https://stackoverflow.com/a/34053829/12132601 …
In that post, instead of having one stream with several blocks in the xz file, they are creating (then reading) an xz file composed of several streams having one block each.
In that case, taking the last stream with tail -c
would be enough because a file composed of one stream is a valid xz file.
I did not found any way to easily do random access in an xz file using the xz
command line (nor pixz
for that matter).
If by chance you are using Python, I take this opportunity to highlight the python-xz
library I wrote as a drop-in replacement to lzma
to perform random-access transparently on xz files.
In your case, something like this:
import xz
with xz.open('2022-06-16T00:00:00.xz') as f:
f.seek(33_554_432) # position is decompressed offset
print(f.read(0x1000000))
Upvotes: 0
Reputation: 71
It appears I just need to cat
the first 12 bytes onto the block I need. i.e.
cat <(head -c 12 2022-06-16T00:00:00.xz) <(head -c 2666360 2022-06-16T00:00:00.xz | tail -c 836008) | unxz -c
Upvotes: 0