Deduplicating dataframe causes issues with splitting dataframe

Question

I have a function to split a dataset into training and test sets:

def train_test_split(df, train_percent=.7, seed=None):
    np.random.seed(seed)
    perm = np.random.permutation(df.index)
    m = len(df.index)
    train_end = int(train_percent * m)
    train = df.iloc[perm[:train_end]]
    test = df.iloc[perm[train_end:]]
    return train, test

It works fine on a dataframe that is 249681 rows × 9 columns

Of these, I decided to drop 4 columns because of too many missing values:

df_subset_dup = df_encode.iloc[:,:5]
df_subset = df_subset_dup.drop_duplicates()

After that, when I do df_trainRaw4, df_testRaw4 = train_test_split(df_subset), I get IndexError: positional indexers are out-of-bounds. However, doing df_trainRaw4, df_testRaw4 = train_test_split(df_subset_dup) returns no errors.

What am I doing with drop_duplicates that's causing the error and how do I rectify it?

akuiper · Accepted Answer

The perm is the data frame's actual index, but you are using position based iloc to subset the data frame with perm; This could be an issue after dropping duplicates which removes some index and now your largest index is larger than the number of rows of the data frame; Changing iloc to loc should fix it:

train = df.loc[perm[:train_end]]
test = df.loc[perm[train_end:]]

Deduplicating dataframe causes issues with splitting dataframe

Answers (1)

Related Questions