Zenith_Raven
Zenith_Raven

Reputation: 115

How to convert a pandas dataframe to a an arrow dataset?

In huggingface library, there is a particular format of datasets called arrow dataset

https://arrow.apache.org/docs/python/dataset.html

https://huggingface.co/datasets/wiki_lingua

I have to convert a normal pandas dataframe to a dataset or read a tabluar csv file as a dataset.

Is that possible?

Upvotes: 9

Views: 10232

Answers (1)

TDrabas
TDrabas

Reputation: 878

You can create a pyarrow.Table and then convert it to a Dataset. Here's an example.

import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd
from datasets import Dataset

df = pd.DataFrame({'a': [0,1,2], 'b': [3,4,5]})
dataset = ds.dataset(pa.Table.from_pandas(df).to_batches())

### convert to Huggingface dataset
hg_dataset = Dataset(pa.Table.from_pandas(df))

To convert to a Table only you can use from_pandas(…) method as shown in the docs and the example above. https://arrow.apache.org/docs/python/pandas.html

A reference to Huggingface docs: https://huggingface.co/docs/datasets/package_reference/main_classes.html#dataset

Upvotes: 12

Related Questions