Reputation: 49
I was trying to build a predictor that tells me if a tweet is talking about a natural disaster o not.
Using the Kaggle dataset.
I ve got:
text target
15 What's up man? 0
16 I love fruits 0
17 Summer is lovely 0
18 My car is so fast 0
The list goes on..
I got for the target, this number of appearance
0 4342
1 3271
Name: target, dtype: int64
This is my DataBlock
dls_lm = DataBlock(
blocks=(TextBlock.from_df('text', seq_len=15, is_lm=True), CategoryBlock),
get_x=ColReader('text'), get_y=ColReader('target'), splitter=ColSplitter())
This is my Dataloaders
dls = dls_lm.dataloaders(df2, bs=24)
This is the error that im having
KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2897 try:
-> 2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'is_valid'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
5 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
-> 2900 raise KeyError(key) from err
2901
2902 if tolerance is not None:
KeyError: 'is_valid'
If anyone knows how I can fix it would really help me. Thanks!
Upvotes: 4
Views: 1032
Reputation: 515
The reason for this error is the parameter splitter=ColSplitter()
.
Replace it with something like splitter=RandomSplitter(valid_pct=0.1, seed=42)
The signature of ColSplitter is
def ColSplitter(col='is_valid'):
"Split `items` (supposed to be a dataframe) by value in `col`"
What does that mean? Well, FastAI split your input data into a train and a validation set to assess the performance of your trained model in every iteration.
ColSplitter expects your input DataFrame to have a column is_valid
that specifies which items (rows) should be in the validation set.
Since you don't have a column called is_valid
in your input data you should replace the ColSplitter with a different data splitting strategy, e.g. random splitting:
splitter=RandomSplitter(valid_pct=0.1, seed=42)
Upvotes: 4