Reputation: 11
I am working on fine tuning a data for an NLP project using the hugginface library. Here is the code i am having the challenge with. Has anyone been able to solve this problem?
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
tf_dataset = testdata.to_tf_dataset(
columns=["input_ids", "token_type_ids", "attention_mask"],
label_cols=["labels"],
batch_size=2,
collate_fn=data_collator,
shuffle=True
)
NB: I have seen suggestions about upgrading to the latest versions, and i have done that but the problem perists.
Upvotes: 1
Views: 2590
Reputation: 155
In your case testdata
is of type DatasetDict
that holds your train split. testdata['train'].to_tf_Dataset()
however is a Dataset
type and will work as expected.
Upvotes: 0
Reputation: 55
I faced the same problem. In my case I was working with a csv file. I used the following code to load the dataset:
from datasets import load_dataset
dataset_training = load_dataset("csv", file)
Then the method to_tf_dataset
returned:
Attribute error: DatasetDict' object has no attribute 'to_tf_dataset'
To overcome this issue I loaded the content as a pandas Dataframe and then I loaded again using another method:
import pandas as pd
data = pd.read_csv("file.csv")
from datasets import Dataset
dataset = Dataset.from_pandas(data)
After that, to_tf_dataset
method worked correctly. I have no explanation for this answer but it worked for me.
Upvotes: 2