MadDanWithABox
MadDanWithABox

Reputation: 103

How do I successfully set and retrieve metadata information for a HuggingfaceDataset on the Huggingface Hub?

I have a number of datasets, which I create from a dictionary like so:

info = DatasetInfo(
        description="my happy lil dataset",
        version="0.0.1",
        homepage="https://www.myhomepage.co.uk"
    )
train_dataset = Dataset.from_dict(prepare_data(data["train"]), info=info)
test_dataset = Dataset.from_dict(prepare_data(data["test"]), info=info)
validation_dataset = Dataset.from_dict(prepare_data(data["validation"]),info=info)

I then combine these into a DatasetDict.

# Create a DatasetDict
dataset = DatasetDict(
    {"train": train_dataset, "test": test_dataset, "validation": validation_dataset}
)

So far, so good. If I access dataset['train'].info.description I see the expected result of "My happy lil dataset".

So I push to the hub, like so:

dataset.push_to_hub(f"{organization}/{repo_name}", commit_message="Some commit message")

And this succeeds too.

However, when I come to pull the dataset back down from the hub, and access the information associated with it, rather than getting the description of my dataset, I just get an empty string; like so:

pulled_data = full = load_dataset("f{organization}/{repo_name}", use_auth_token = True)

# I expect the following to print out "my happy lil dataset"
print(pulled_data["train"].info.description)
# However, instead it returns ''

Am I loading my data in from the hub incorrectly? Am I pushing only my dataset and not the info somehow? I feel like I’m missing something obvious, but I’m really not sure.

Upvotes: 0

Views: 64

Answers (1)

M Aung
M Aung

Reputation: 34

It might be due to version caching of dataset. Without explicit version attribute, the library's default versioning may not preserve all metadata like Description.

Please include VERSION in a wrapper class like:

import datasets

class My_dataset(datasets.GeneratorBasedBuilder):

    VERSION = datasets.Version("1.0.0")
    def _info(self) -> datasets.DatasetInfo:
        return datasets.DatasetInfo(
            description="my happy lil dataset",
            features=datasets.Features(
                {
                    "f1": datasets.Value("string"), # list of features provided by your dataset with their types
                    "f2": datasets.Value("string"),
                }
            ),
            homepage="https://www.myhomepage.co.uk",
        )  

Upvotes: 0

Related Questions