albero
albero

Reputation: 199

Add new column to a HuggingFace dataset

In the dataset I have 5000000 rows, I would like to add a column called 'embeddings' to my dataset.

dataset = dataset.add_column('embeddings', embeddings)

The variable embeddings is a numpy memmap array of size (5000000, 512).

But I get this error:

ArrowInvalidTraceback (most recent call last) in ----> 1 dataset = dataset.add_column('embeddings', embeddings)

/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs) 486 } 487 # apply actual function --> 488 out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) 489 datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out] 490 # re-apply format to the output

/opt/conda/lib/python3.8/site-packages/datasets/fingerprint.py in wrapper(*args, **kwargs) 404 # Call actual function 405 --> 406 out = func(self, *args, **kwargs) 407 408 # Update fingerprint of in-place transforms + update in-place history of transforms

/opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py in add_column(self, name, column, new_fingerprint) 3346 :class:Dataset 3347 """ -> 3348 column_table = InMemoryTable.from_pydict({name: column}) 3349 # Concatenate tables horizontally 3350 table = ConcatenationTable.from_tables([self._data, column_table], axis=1)

/opt/conda/lib/python3.8/site-packages/datasets/table.py in from_pydict(cls, *args, **kwargs) 367 @classmethod 368 def from_pydict(cls, *args, **kwargs): --> 369 return cls(pa.Table.from_pydict(*args, **kwargs)) 370 371 @inject_arrow_table_documentation(pa.Table.from_batches)

/opt/conda/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()

/opt/conda/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib._from_pydict()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.asarray()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/opt/conda/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: only handle 1-dimensional arrays

How can I solve, possibly in an efficient way, since the embeddings array does not fit the RAM?

Upvotes: 8

Views: 8126

Answers (5)

paradocslover
paradocslover

Reputation: 3294

In my case, I was facing this error while adding a column that had TF-IDF representations (2d ndarray) of the text sequences.

All I did was:

df_encoded['train'].add_column("tf_idf_repr", tf_idf_feats.tolist())

It needs a list of lists. This worked!

Upvotes: 0

Maxim R.
Maxim R.

Reputation: 11

For a dataset loaded from a parquet file, this code worked

data['train'].add_column(name="id", column=[i for i in range(len(data['train']))])

Upvotes: 1

Cyebukayire
Cyebukayire

Reputation: 947

This is how I solved the issue. It's sad that we are in 2023 and there is still this issue but fortunately, this worked for me.

Add a new column to a dataset

def add_new_column(df, col_name, col_values):
    # Define a function to add the new column
    def create_column(updated_df):
        updated_df[col_name] = col_values  # Assign specific values
        return updated_df

    # Apply the function to each item in the dataset
    df = df.map(create_column)

    return df

Then, you may call the function like this:

import datasets as ds

dataset =  ds.Dataset.from_dict({"column_1": ["value1", "value2"],})
new_values = [value3, value4]
updated_dataset = add_new_column(dataset, "column_2", [str(val) for val in new_values])

Note: Please ensure that the size of the new column (A.K.A)

len(new_values)

is equal to the size of the existing number of rows.

Upvotes: 1

0x26res
0x26res

Reputation: 13942

The issue here is that you're trying to add a column, but the data you are passing is a 2d numpy array. arrow (the library used to represent datasets) only supports 1d numpy array.

You can try to add each column of your 2d numpy array one by one:

for i, column in enumerate(embeddings.T):
    ds = ds.add_column('embeddings_' + str(i), column)

How can I solve, possibly in an efficient way, since the embeddings array does not fit the RAM?

I don't think there's a work around the memory issue. huggingface datasets are backed by arrow table, which have to fit in memory.

Upvotes: 1

Vyom Vyas
Vyom Vyas

Reputation: 59

from datasets import load_dataset

ds = load_dataset("cosmos_qa", split="train")

new_column = ["foo"] * len(ds)
ds = ds.add_column("new_column", new_column)

and you get a dataset

Dataset({
    features: ['id', 'context', 'question', 'answer0', 'answer1', 'answer2', 'answer3', 'label', 'new_column'],
    num_rows: 25262
})

Upvotes: 5

Related Questions