Reputation: 83387
I downloaded a dataset hosted on HuggingFace via the HuggingFace CLI as follows:
pip install huggingface_hub[hf_transfer]
huggingface-cli download huuuyeah/MeetingBank_Audio --repo-type dataset --local-dir-use-symlinks False
However, the downloaded files don't have their original filenames. Instead, their hashes (git-sha or sha256, depending on whether they’re LFS files) are used as filenames:
--- /home/dernonco/.cache/huggingface/hub/datasets--huuuyeah--MeetingBank_Audio/blobs ---------------------------------------------
/..
12.9 GiB [##########] b581945ddee5e673fa2059afb25274b1523f270687b5253cb8aa72865760ebc0
3.9 GiB [### ] 86ebd2861a42b27168d75f346dd72f0e2b9eaee0afb90890beff15d025af45c6
3.9 GiB [## ] f9b81739ee30450b930390e1155e2cdea1b3063379ba6fd9253513eba1ab1e05
3.7 GiB [## ] e54c7d123ad93f4144eebdca2827ef81ea1ac282ddd2243386528cd157c02f36
3.7 GiB [## ] 736e225a7dd38a7987d0745b1b2f545ab701cfdf1f639874f5743b5bfb5cb1e1
3.7 GiB [## ] 0687246c92ec87b54e1c5fe623a77b650c02e6884e17a6f0fb4052a862d928d0
3.6 GiB [## ] 2becb5f9878b95f1b12622f50868f5855221985f05910d7cc759e6be074e6b8e
3.5 GiB [## ] 2208068c69b39c46ee9fac862da3c060c58b61adcaee1b3e6aa5d6d5dd3eba86
3.5 GiB [## ] caf87e71232cbb8a31960a26ba30b9412c15893c831ef118196c581cfd3a3779
3.4 GiB [## ] dc88cbf0ef45351bdc1f53c4396466d3e79874803719e266630ed6c3ad911d6a
3.4 GiB [## ] f05f7fb3b55b6840ebc4ada5daa28742bbae6ad4dcc35781dc811024f27a1b4e
3.4 GiB [## ] 88bd831618b36330ef5cd84b7ccbc4d5f3f55955c0b223208bc2244b27fb2d78
3.4 GiB [## ] bf80943b3389ddbeb8fb8a56af2d7fa5d09c5af076aac93f54ad921ee382c77d
3.3 GiB [## ] 83b2627e644c9ad0486e3bd966b02f014722e668d26b9d52394c974fcf2fdcf8
3.2 GiB [## ] e52e7b086dabd431b25cf309e1fe513190543e058f4e7a2d8e05b22821ded4fe
3.2 GiB [## ] 4fe583348f3ac118f34c7b93b6a187ba4e21a5a7f5b6ca1a6adbce1cc6d563a9
3.2 GiB [## ] ae6b6faca3bbd75e7ca99ccf20b55b017393bf09022efb8459293afffe06dc6e
3.1 GiB [## ] 5865379a894f8dc40703bdc1093d45fda67d5e1a742a2eebddd37e1a00f067fd
3.1 GiB [## ] cd346324b29390a589926ccab7187ae818cf5f9fcbaf8ecc95313e6cdfab86bc
3.0 GiB [## ] 914eb2b1174a662e3faebac82f6b5591a54def39a9d3a7e5ab2347ecc87a982f
2.9 GiB [## ] 24789f33332e8539b2ee72a0a489c0f4d0c6103f7f9600de660d78543ade9111
2.9 GiB [## ] 35e8da5f831b36416c9569014c58f881a0a30c00db9f3caae0d7db6a8fd3c694
2.8 GiB [## ] d5127e0298661d40a343d58759ed6298f9d2ef02d5c4f6a30bd9e07bc5423317
2.8 GiB [## ] 1b4e1951da2462ca77d94d220a58c97f64caa2b2defe4df95feed9defcee6ca7
2.8 GiB [## ] 75a4725625c095d98ecef7d68d384d7b1201ace046ef02ed499776b0ac02b61e
2.8 GiB [## ] fefbbc3e87be522b7e571c78a188aba35bd5d282cf8f41257097a621af64ff60
Total disk usage: 184.8 GiB Apparent size: 184.8 GiB Items: 85
How can I download a HuggingFace dataset via HuggingFace CLI while keeping the original filenames?
Upvotes: 1
Views: 2547
Reputation: 77
cp -L
allows you to follow symbolic links when copying files. Just cp -L huggingface/hub/<YOUR_REPO>/snapshots/<HASH> <DESTINATION_PATH>
. You can also use --reflink
to avoid extra usage of a copy.
Upvotes: 0
Reputation: 76
I met the same problem, and wrote a python script to handle this problem.
For example, I download the naver-clova-ix/synthdog-en dataset by:
$ huggingface-cli download --repo-type dataset --resume-download naver-clova-ix/synthdog-en --local-dir synthdog-en
The synthdog-en directory structure is as follows:
synthdog-en
├── README.md
├── data
│ ├── train-00000-of-00084-26dbc51f3d0903b9.parquet -> ../../../.cache/huggingface/hub/datasets--naver-clova-ix--synthdog-en/blobs/9d0260e08cb5a4f9c14fa794465bcb66fae6ef7ccc2f6d7ef20efa44810c0648
│ ├── train-00001-of-00084-3efa94914043c815.parquet -> ../../../.cache/huggingface/hub/datasets--naver-clova-ix--synthdog-en/blobs/04441e203ff713743c0c9a1009f71f97e47bc4d7b2c9313f4fcfa9c3e73b20e3
│ ├── ...
│ └── validation-00000-of-00001-394e0bd4c5ebec42.parquet -> ../../../.cache/huggingface/hub/datasets--naver-clova-ix--synthdog-en/blobs/4e5f27b7a976041855d80eb07680de4ea014be07a494f40b246058dfce46d44b
└── dataset_infos.json
The full python script code is as follows:
import shutil
from pathlib import Path
from tqdm import tqdm
def cp_symlink_file_to_dst(file_path: Path, dst_dir: Path):
if not file_path.is_symlink():
return
real_file_path = file_path.readlink()
real_file_path = Path.home() / str(real_file_path).rpartition("../")[-1]
real_file_name = file_path.name
dst_file_path = Path(dst_dir) / real_file_name
shutil.copy(real_file_path, dst_file_path)
if __name__ == "__main__":
data_dir = Path("data")
data_paths = list(data_dir.glob("*.parquet"))
dst_dir = Path("output")
dst_dir.mkdir(parents=True, exist_ok=True)
for file_path in tqdm(data_paths):
cp_symlink_file_to_dst(file_path, dst_dir)
The output directory is as follows:
output
├── train-00000-of-00084-26dbc51f3d0903b9.parquet
├── train-00001-of-00084-3efa94914043c815.parquet
├── ...
├── train-00083-of-00084-5e6bb79e23f90f3b.parquet
└── validation-00000-of-00001-394e0bd4c5ebec42.parquet
Upvotes: 1
Reputation: 83387
One has to look at the snapshots
folder:
/home/username/.cache/huggingface/hub/datasets--huuuyeah--MeetingBank_Audio/snapshots
It contains the original, readable filenames. However, the files are symlinks pointing to the blob files with hashes as filenames. One can replace these symbolic links with the actual files (stored in the blob), and it'll give the original files with the original filenames.
To replace the symbolic links with the actual files on Linux, one can use u1686_grawity's script:
script.sh
:#!/bin/sh set -e for link; do test -h "$link" || continue dir=$(dirname "$link") reltarget=$(readlink "$link") case $reltarget in /*) abstarget=$reltarget;; *) abstarget=$dir/$reltarget;; esac rm -fv "$link" cp -afv "$abstarget" "$link" || { # on failure, restore the symlink rm -rfv "$link" ln -sfv "$reltarget" "$link" } done
To run:
find . -type l -exec /path/to/script.sh {} +
result:
(base) dernoncourt@server:~/.cache/huggingface/hub/datasets--huuuyeah--MeetingBank_Audio/snapshots$ tree --du -h
.
└── [255G] 27779a666ff5fd879f4c5567489ff47e82364abd
├── [ 29G] Alameda
│ ├── [499M] Alameda-transcripts-videolist.zip
│ └── [ 29G] mp3
│ ├── [3.7G] alameda-1.zip
│ ├── [3.5G] alameda-2.zip
│ ├── [3.9G] alameda-3.zip
│ ├── [3.7G] alameda-4.zip
│ ├── [3.3G] alameda-5.zip
│ ├── [3.4G] alameda-6.zip
│ ├── [2.9G] alameda-7.zip
│ ├── [3.0G] alameda-8.zip
│ └── [1.2G] alameda-9.zip
├── [3.9G] Boston
│ ├── [3.2K] boston_video_list.txt
│ ├── [3.9G] mp3
│ │ └── [3.9G] Boston.zip
│ └── [8.1M] transcripts
│ └── [8.1M] transcripts.zip
├── [ 54G] Denver
│ ├── [650M] Denver-transcripts-videolist.zip
│ └── [ 53G] mp3
│ ├── [2.0G] Denver-1.zip
│ ├── [3.2G] Denver-10.zip
│ ├── [2.5G] Denver-11.zip
│ ├── [2.7G] Denver-12.zip
│ ├── [1.5G] Denver-13.zip
│ ├── [2.5G] Denver-14.zip
[...]
Upvotes: 0