Sachin
Sachin

Reputation: 283

How to enable header in text files of load_dataset in huggingface?

I am trying to load a text file using huggingface (https://huggingface.co/docs/datasets/v1.2.1/loading_datasets.html)

from datasets import load_dataset
dataset = load_dataset('text', data_files='my_file.txt')

This text file already contains headers, how do I indicate this to the module (say, header = True, in case of pandas read_csv())?

Also, how do I mention that it is tab/comma separated?

Is there a way to present this data in tabular format?

Upvotes: 0

Views: 1743

Answers (3)

Ahmad
Ahmad

Reputation: 9678

You can read it by pandas and then convert to a dataset:

  from datasets import Dataset
  import pandas as pd

  df = pd.read_table(path)
  ds = Dataset.from_pandas(df)

Upvotes: 0

Betty
Betty

Reputation: 572

This is an old question, but for newcomers:

Read tsv file:

from datasets import load_dataset

dataset = load_dataset("csv", 'path/to/your/file.tsv', delimiter='\t')

By default, it will infer the column names from the first line.

If your file don't have a header line and you want to specify the column names, use

column_names = ['col1', 'col2', 'col3']
dataset = load_dataset("csv", 'path/to/your/file.tsv', delimiter='\t', column_names=column_names)

See the docs: https://huggingface.co/docs/datasets/v2.12.0/en/package_reference/loading_methods#datasets.packaged_modules.csv.CsvConfig

Upvotes: 2

cronoik
cronoik

Reputation: 19510

They are using pandas.read_csv() and you can pass parameters through load_dataset:

from datasets import load_dataset

a = load_dataset("csv", data_files="bla.tsv", sep="\t")

Upvotes: 0

Related Questions