Juan Cruz Alric
Juan Cruz Alric

Reputation: 49

FASTAI Error while creating a DataLoaders from a dataframe

I was trying to build a predictor that tells me if a tweet is talking about a natural disaster o not.

Using the Kaggle dataset.

I ve got:

    text               target
15  What's up man?      0
16  I love fruits       0
17  Summer is lovely    0
18  My car is so fast   0

The list goes on..

I got for the target, this number of appearance

0 4342

1 3271

Name: target, dtype: int64

This is my DataBlock

dls_lm = DataBlock(
blocks=(TextBlock.from_df('text', seq_len=15, is_lm=True), CategoryBlock),
get_x=ColReader('text'), get_y=ColReader('target'), splitter=ColSplitter())

This is my Dataloaders

dls = dls_lm.dataloaders(df2, bs=24)

This is the error that im having

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897             try:
-> 2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'is_valid'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
5 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:
-> 2900                 raise KeyError(key) from err
   2901 
   2902         if tolerance is not None:

KeyError: 'is_valid'

If anyone knows how I can fix it would really help me. Thanks!

Upvotes: 4

Views: 1032

Answers (1)

goerlitz
goerlitz

Reputation: 515

The reason for this error is the parameter splitter=ColSplitter().

TL;TR

Replace it with something like splitter=RandomSplitter(valid_pct=0.1, seed=42)

Detailed Answer

The signature of ColSplitter is

def ColSplitter(col='is_valid'):
    "Split `items` (supposed to be a dataframe) by value in `col`"

What does that mean? Well, FastAI split your input data into a train and a validation set to assess the performance of your trained model in every iteration.

ColSplitter expects your input DataFrame to have a column is_valid that specifies which items (rows) should be in the validation set.

Since you don't have a column called is_valid in your input data you should replace the ColSplitter with a different data splitting strategy, e.g. random splitting:

splitter=RandomSplitter(valid_pct=0.1, seed=42)

Upvotes: 4

Related Questions