Reputation: 1044
I have a function to split a dataset into training and test sets:
def train_test_split(df, train_percent=.7, seed=None):
np.random.seed(seed)
perm = np.random.permutation(df.index)
m = len(df.index)
train_end = int(train_percent * m)
train = df.iloc[perm[:train_end]]
test = df.iloc[perm[train_end:]]
return train, test
It works fine on a dataframe that is 249681 rows × 9 columns
Of these, I decided to drop 4 columns because of too many missing values:
df_subset_dup = df_encode.iloc[:,:5]
df_subset = df_subset_dup.drop_duplicates()
After that, when I do df_trainRaw4, df_testRaw4 = train_test_split(df_subset)
, I get IndexError: positional indexers are out-of-bounds
. However, doing df_trainRaw4, df_testRaw4 = train_test_split(df_subset_dup)
returns no errors.
What am I doing with drop_duplicates
that's causing the error and how do I rectify it?
Upvotes: 0
Views: 23
Reputation: 214967
The perm
is the data frame's actual index, but you are using position based iloc
to subset the data frame with perm
; This could be an issue after dropping duplicates which removes some index and now your largest index is larger than the number of rows of the data frame; Changing iloc
to loc
should fix it:
train = df.loc[perm[:train_end]]
test = df.loc[perm[train_end:]]
Upvotes: 1