Reputation: 31801
Is there a way to create a tar archive that will only contain file names but omit the actual file data?
The intent is to create a hierarchical 'mirror' of a drive that will only contain the directory structure and file names (preferably with sizes) but omit the actual file data.
The purpose is to generate an inventory of what is on a disk, i.e. something that would be better and faster than the output of ls -R -S -l /
but possibly in a less verbose format.
I am aware that the question is about [mis-]using tar
for something that it is not meant to be used for, but would like to investigate all options and push the limits of what is possible.
One possible option I'm experimenting with is creating a RAM tmpfs
filesystem (in order to avoid writing to disk unnecessarily and increase the speed) and then use lndir
(from the xutils-dev
package) to mirror the entire subtree using symlinks lndir /media/usb1 /ramtmpfs
and then do tar usb1-filelist.tar /ramtmpfs
. One limitation I'm running into with this approach is RAM size which is easily exceeded with large subtrees even if just creating symlinks. Is there a better/saner way, possibly something that tar
can do on its own?
Upvotes: 1
Views: 155
Reputation: 31801
Following the hint from @CharlesDuffy here is the python compression with both tarfile
(for .tar.gz) and zipfile
(for .zip). It takes the folder to 'archive' as 1st arg and the name of the resulting TAR archive.
Filling with zeros is only needed in order to display the correct original file size. Omitting it will speed up the operation significantly since compressing zeros is extra overhead, especially when files are huge.
#!/usr/bin/env python3
import io
import pathlib
import sys
import tarfile
def create_tar(folder: str, archive: str):
# compresslevel=1 - fastest, bigger file
# compresslevel=9 - slowest, smallest file
with tarfile.open(archive, mode="w:gz", compresslevel=1) as tar:
for path in pathlib.Path(folder).glob('**/*'):
if path.is_file():
size: int = path.stat().st_size
if path.name.startswith(('.DS_Store', '._')):
print(f'Skipping {path.absolute()}')
continue
print(f'adding {path.absolute()}...', end='')
tar_info: tarfile.TarInfo = tarfile.TarInfo(name=str(path))
tar_info.size = size
memfile = io.BytesIO()
memfile.write(b'\x00' * size)
memfile.seek(0)
tar.addfile(tarinfo=tar_info, fileobj=memfile)
print('ok')
if __name__ == '__main__':
folder: str = sys.argv[1]
archive_name: str = sys.argv[2]
create_tar(folder, archive_name)
#!/usr/bin/env python3
import sys
import zipfile
from datetime import datetime
from pathlib import Path
from zipfile import ZipFile
from zipfile import ZipInfo
def create_zip(folder: str, archive_name: str):
with ZipFile(file=archive_name, mode='w', compression=zipfile.ZIP_DEFLATED) as zipper:
for path in Path(folder).glob('**/*'):
if path.is_file():
size: int = path.stat().st_size
if path.name.startswith(('.DS_Store', '._')):
print(f'Skipping {path.absolute()}')
continue
print(f'adding {path.absolute()}...', end='')
modified = datetime.fromtimestamp(path.stat().st_mtime)
dt = (modified.year, modified.month, modified.day, modified.hour, modified.minute, modified.second)
zip_info: zipfile.ZipInfo = ZipInfo(filename=str(path), date_time=dt)
zip_info.file_size = size # if you set data=b'' + zipfile.ZIP_STORED then this *must* be zero or unset
# create 0-length files
# zipper.writestr(zip_info, data=b'', compress_type=zipfile.ZIP_STORED)
# create dummy files filled with zero that compress well (for original file size display)
zipper.writestr(zip_info, data=b'\x00' * size, compress_type=zipfile.ZIP_DEFLATED)
print('ok')
if __name__ == '__main__':
folder: str = sys.argv[1]
archive_name: str = sys.argv[2]
archive: str = create_zip(folder, archive_name)
Given that the original purpose of the tar was to generate an 'inventory' of a subtree for archival purposes that won't include the actual data, tar
is probably not the best tool for this.
Some more suitable tools for this could be gnu find
or tree
; for example to create a JSON file with a directory listing of /media/usb0
that will include modification dates and file sizes:
tree -J --dirsfirst --charset utf-8 --ignore-case \
--timefmt '%d-%b-%Y %H:%M' -s -D \
-o usb0-index.json /media/usb0
Alternative approach with find
to create a tsv file (implies GNU find
, not bsd/macos find
(on macos use gfind
after installing with brew install findutils
):
find . -not -path "*.Spotlight-V100*" \
-not -path "*.DS_Store*" \
-not -path "*.Trash*" \
-not -path "*node_modules*" \
-printf "%P\t%s\t%TY-%Tm-%Td %TH:%TM\n" > files.tsv
Now the tab-separated file files.tsv
can be imported into e.g. an sqlite database:
sqlite3 files.db ".mode tabs" ".import files.tsv mytable"
Upvotes: 4