HuggingFace Dataset - pyarrow.lib.ArrowMemoryError: realloc of size failed

Question

I am trying to use Hugginface Datasets for speech recognition using transformers, where I have pairs of text/audio. I am creating a Dataframe without problem with these two lists:

d = pd.DataFrame.from_dict({"audio": ts_audios, "sentence": ts_sent})

But when trying to wrap this to a Dataset (from Hugginface datasets):

ds=Dataset.from_pandas(d)

it gives:

pyarrow.lib.ArrowMemoryError: realloc of size 4294967296 failed

The problem is because of the audios list, that looks like this:

[array([ 1.3715802e-05,  1.3041631e-05, -1.5017368e-06, ...,
       -1.1172481e-01, -1.2214723e-01,  0.0000000e+00], dtype=float32), array([-0.06073862, -0.12271373, -0.11600843, ..., -0.11915235,
       -0.13458692,  0.        ], dtype=float32), array([-0.07074431, -0.12263235, -0.1065825 , ..., -0.10845864,
       -0.12171803,  0.        ], dtype=float32), array([-0.02499148, -0.04160473, -0.03867628, ..., -0.01881211,
       -0.02035856,  0.        ], dtype=float32), array([-0.18304674, -0.03917564, -0.030768  , ..., -0.11494933,
       -0.112398  , -0.12073436], dtype=float32) .....]

I must use the Dataset format if I want to use transformers package from Huggingface. Any idea how can I solve this issue?

HuggingFace Dataset - pyarrow.lib.ArrowMemoryError: realloc of size failed

Answers (1)

Related Questions