ccpizza
ccpizza

Reputation: 31801

How to create a 'shallow' tar archive without the actual file data?

Is there a way to create a tar archive that will only contain file names but omit the actual file data?

The intent is to create a hierarchical 'mirror' of a drive that will only contain the directory structure and file names (preferably with sizes) but omit the actual file data.

The purpose is to generate an inventory of what is on a disk, i.e. something that would be better and faster than the output of ls -R -S -l / but possibly in a less verbose format.

I am aware that the question is about [mis-]using tar for something that it is not meant to be used for, but would like to investigate all options and push the limits of what is possible.

One possible option I'm experimenting with is creating a RAM tmpfs filesystem (in order to avoid writing to disk unnecessarily and increase the speed) and then use lndir (from the xutils-dev package) to mirror the entire subtree using symlinks lndir /media/usb1 /ramtmpfs and then do tar usb1-filelist.tar /ramtmpfs. One limitation I'm running into with this approach is RAM size which is easily exceeded with large subtrees even if just creating symlinks. Is there a better/saner way, possibly something that tar can do on its own?

Upvotes: 1

Views: 155

Answers (1)

ccpizza
ccpizza

Reputation: 31801

Following the hint from @CharlesDuffy here is the python compression with both tarfile (for .tar.gz) and zipfile (for .zip). It takes the folder to 'archive' as 1st arg and the name of the resulting TAR archive.

Filling with zeros is only needed in order to display the correct original file size. Omitting it will speed up the operation significantly since compressing zeros is extra overhead, especially when files are huge.

Create TAR file with fake zero-filled files

#!/usr/bin/env python3
import io
import pathlib
import sys
import tarfile


def create_tar(folder: str, archive: str):
    # compresslevel=1 - fastest, bigger file
    # compresslevel=9 - slowest, smallest file
    with tarfile.open(archive, mode="w:gz", compresslevel=1) as tar:
        for path in pathlib.Path(folder).glob('**/*'):
            if path.is_file():
                size: int = path.stat().st_size
                if path.name.startswith(('.DS_Store', '._')):
                    print(f'Skipping {path.absolute()}')
                    continue

                print(f'adding {path.absolute()}...', end='')

                tar_info: tarfile.TarInfo = tarfile.TarInfo(name=str(path))
                tar_info.size = size
                memfile = io.BytesIO()
                memfile.write(b'\x00' * size)
                memfile.seek(0)
                tar.addfile(tarinfo=tar_info, fileobj=memfile)

                print('ok')


if __name__ == '__main__':
    folder: str = sys.argv[1]
    archive_name: str = sys.argv[2]

    create_tar(folder, archive_name)

Create ZIP file with fake zero-filled files

#!/usr/bin/env python3
import sys
import zipfile
from datetime import datetime
from pathlib import Path
from zipfile import ZipFile
from zipfile import ZipInfo


def create_zip(folder: str, archive_name: str):
    with ZipFile(file=archive_name, mode='w', compression=zipfile.ZIP_DEFLATED) as zipper:
        for path in Path(folder).glob('**/*'):
            if path.is_file():
                size: int = path.stat().st_size
                if path.name.startswith(('.DS_Store', '._')):
                    print(f'Skipping {path.absolute()}')
                    continue

                print(f'adding {path.absolute()}...', end='')

                modified = datetime.fromtimestamp(path.stat().st_mtime)
                dt = (modified.year, modified.month, modified.day, modified.hour, modified.minute, modified.second)
                zip_info: zipfile.ZipInfo = ZipInfo(filename=str(path), date_time=dt)
                zip_info.file_size = size  # if you set data=b'' + zipfile.ZIP_STORED then this *must* be zero or unset

                # create 0-length files
                # zipper.writestr(zip_info, data=b'', compress_type=zipfile.ZIP_STORED)

                # create dummy files filled with zero that compress well (for original file size display)
                zipper.writestr(zip_info, data=b'\x00' * size, compress_type=zipfile.ZIP_DEFLATED)
                print('ok')


if __name__ == '__main__':

    folder: str = sys.argv[1]
    archive_name: str = sys.argv[2]

    archive: str = create_zip(folder, archive_name)

Alternatives

Given that the original purpose of the tar was to generate an 'inventory' of a subtree for archival purposes that won't include the actual data, tar is probably not the best tool for this.

Some more suitable tools for this could be gnu find or tree; for example to create a JSON file with a directory listing of /media/usb0 that will include modification dates and file sizes:

tree -J --dirsfirst --charset utf-8 --ignore-case \
    --timefmt '%d-%b-%Y %H:%M' -s -D \
    -o usb0-index.json /media/usb0

Alternative approach with find to create a tsv file (implies GNU find, not bsd/macos find (on macos use gfind after installing with brew install findutils):

find . -not -path "*.Spotlight-V100*" \
  -not -path "*.DS_Store*" \
  -not -path "*.Trash*" \
  -not -path "*node_modules*" \
  -printf "%P\t%s\t%TY-%Tm-%Td %TH:%TM\n" > files.tsv

Now the tab-separated file files.tsv can be imported into e.g. an sqlite database:

sqlite3 files.db ".mode tabs" ".import files.tsv mytable"

Upvotes: 4

Related Questions