Karl
Karl

Reputation: 5822

Update pandas dataframe based on slice

I have a dataframe that I wish to split into "train" and "test" datasets using the sklearn.model_selection.train_test_split function. This function returns two slices of the original DataFrame. I however need this to be in a single DataFrame, with a column entry that identifies identifies the entry type. I could write a function that does this instead, but using the sklearn function is convenient and reliable.

My current approach is as follows:

import pandas as pd
import numpy as np
from sklearn import model_selection

dates = pd.date_range('20130101',periods=10)
df = pd.DataFrame(np.random.randn(10,4),index=dates,columns=list('ABCD')).reset_index()

split = [0.8, 0.2]
split_seed = 123

train_df, test_df = model_selection.train_test_split(df, train_size = split[0], test_size = split[1], random_state=split_seed)

train_df["Dataset"] = "train"
test_df["Dataset"] = "test"

final_df = train_df.append(test_df)

This works perfectly, but results in a warning since I am modifying copied slices instead of the original df object:

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

It doesn't really matter since the original DataFrame is no longer used after this. But I'm curious how I could do this differently. I presume that instead of editing train_df and test_df and the appending them again, I could just edit df directly, but as I am not very familiar with how .loc and .iloc works I'm struggling to see how this would work.

Psuedo code that illustrates what I am looking for would be as follows:

df["Dataset"] = "train" WHERE index in train_df.index.values
df["Dataset"] = "test" WHERE index in test_df.index.values

Upvotes: 1

Views: 1932

Answers (2)

jpp
jpp

Reputation: 164683

One way is to use np.where to add a series conditional on a Boolean condition:

df['Dataset'] = np.where(df.index.isin(train_df.index.values), 'train', 'test')

This assumes, of course, indices not contained in train_df must exist in test_df.

Or use np.select for a more adaptable solution:

conds = [df.index.isin(train_df.index.values),
         df.index.isin(test.index.values)]

df['Dataset'] = np.select(conds, ['train', 'test'], 'other')

Upvotes: 3

user3471881
user3471881

Reputation: 2724

If you don't want to copy your DataFrame in the model_selection.train_test_split() call you can use loc:

df.loc[train_df.index, 'Dataset'] = 'train'
df.loc[test_df.index, 'Dataset'] = 'test'

Upvotes: 3

Related Questions