Reputation: 283
I am trying to load a text file using huggingface (https://huggingface.co/docs/datasets/v1.2.1/loading_datasets.html)
from datasets import load_dataset
dataset = load_dataset('text', data_files='my_file.txt')
This text file already contains headers, how do I indicate this to the module (say, header = True
, in case of pandas read_csv()
)?
Also, how do I mention that it is tab/comma separated?
Is there a way to present this data in tabular format?
Upvotes: 0
Views: 1743
Reputation: 9678
You can read it by pandas
and then convert to a dataset:
from datasets import Dataset
import pandas as pd
df = pd.read_table(path)
ds = Dataset.from_pandas(df)
Upvotes: 0
Reputation: 572
This is an old question, but for newcomers:
Read tsv file:
from datasets import load_dataset
dataset = load_dataset("csv", 'path/to/your/file.tsv', delimiter='\t')
By default, it will infer the column names from the first line.
If your file don't have a header line and you want to specify the column names, use
column_names = ['col1', 'col2', 'col3']
dataset = load_dataset("csv", 'path/to/your/file.tsv', delimiter='\t', column_names=column_names)
See the docs: https://huggingface.co/docs/datasets/v2.12.0/en/package_reference/loading_methods#datasets.packaged_modules.csv.CsvConfig
Upvotes: 2
Reputation: 19510
They are using pandas.read_csv() and you can pass parameters through load_dataset:
from datasets import load_dataset
a = load_dataset("csv", data_files="bla.tsv", sep="\t")
Upvotes: 0