Reputation: 115
In huggingface library, there is a particular format of datasets called arrow dataset
https://arrow.apache.org/docs/python/dataset.html
https://huggingface.co/datasets/wiki_lingua
I have to convert a normal pandas dataframe to a dataset or read a tabluar csv file as a dataset.
Is that possible?
Upvotes: 9
Views: 10232
Reputation: 878
You can create a pyarrow.Table
and then convert it to a Dataset
. Here's an example.
import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd
from datasets import Dataset
df = pd.DataFrame({'a': [0,1,2], 'b': [3,4,5]})
dataset = ds.dataset(pa.Table.from_pandas(df).to_batches())
### convert to Huggingface dataset
hg_dataset = Dataset(pa.Table.from_pandas(df))
To convert to a Table
only you can use from_pandas(…)
method as shown in the docs and the example above. https://arrow.apache.org/docs/python/pandas.html
A reference to Huggingface docs: https://huggingface.co/docs/datasets/package_reference/main_classes.html#dataset
Upvotes: 12