Mittenchops
Mittenchops

Reputation: 19704

Chunked tokenization in huggingface has an arrow error

I'm following the code from this video at 1m25s, which shows:

def tokenize_and_chunk(texts):
  return tokenizer(
    texts["text"], truncation=True, max_length=context_length,
    return overflowing_tokens=True
  )

tokenized_datasets = raw_datasets.map(
  tokenize_and_chunk, batched=True, remove_columns=["text"]
)

Here's the error I get when I try to run this code:

model_name = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

context_length = 1000

def tokenize_and_chunk(texts):
    return tokenizer(
      texts["text"], truncation=True, max_length=context_length,
      return_overflowing_tokens=True,
  )

dataset = Dataset.from_pandas(pd.DataFrame([{"id": "123", "text": "Here are many words! "*5000}]))  

Shows a fine data set:

Dataset({
    features: ['id', 'text'],
    num_rows: 1
})

Ok,let's run the tokenizer:

toknized_datasets = dataset.map(tokenize_and_chunk, batched=True, remove_columns=["text"])

 0%
0/1 [00:00<?, ?ba/s]

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-69-d1216744e2ab> in <module>
----> 1 toknized_datasets = dataset.map(tokenize_and_chunk, batched=True, remove_columns=["text"])

ArrowInvalid: Column 1 named id expected length 5 but got length 1

Upvotes: 1

Views: 1424

Answers (1)

Domarm
Domarm

Reputation: 2550

Your Dataset is containing not only "text" column but "id" column as well.
Remove column "id" and run map function - all works now.

Also, in Your video tutorial.
If You look into his example using "imdb" dataset (at 48th second), he is then removing column "label", leaving his dataset with only one column "text".

EDIT
To make Your dataset works without removing "id" before passing it to the map function, You have to then change remove_columns=["text"] into remove_columns=["id", "text"]

Upvotes: 4

Related Questions