Reputation: 39
I currently have a CSV that contains many rows (some 200k) with many columns on each. I basically want to have a time series training and test data split. I have many unique items inside of my dataset, and I want the first 80% (chronologically) of each to be in the training data. I wrote the following code to do so
import pandas as pd
df = pd.read_csv('Data.csv')
df['Date'] = pd.to_datetime(df['Date'])
test = pd.DataFrame()
train = pd.DataFrame()
itemids = df.itemid.unique()
for i in itemids:
df2 = df.loc[df['itemid'] == i]
df2 = df2.sort_values(by='Date',ascending=True)
trainvals = df2[:int(len(df2)*0.8)]
testvals = df2[int(len(df2)*0.8):]
train.append(trainvals)
test.append(testvals)
It seems like trainvals and testvals are being populated properly, but they are not being added into test and train. Am I adding them in wrong?
Upvotes: 2
Views: 1149
Reputation: 5183
You can avoid the loop with df.groupby.quantile
.
train = df.groupby('itemid').quantile(0.8)
test = df.loc[~df.index.isin(train.index), :] # all rows not in train
Note this could have unexpected behavior if df.index
is not unique.
Upvotes: 1
Reputation: 107567
Your immediate issue is not re-assigning inside for-loop:
train = train.append(trainvals)
test = test.append(testvals)
However, it becomes memory inefficient to grow extensive objects like data frames in a loop. Instead, consider iterating across groupby
to build a list of dictionaries containing test and train splits via list comprehension. Then call pd.concat
to bind each set together. Use a defined method to organize processing.
def split_dfs(df):
df = df.sort_values(by='Date')
trainvals = df[:int(len(df)*0.8)]
testvals = df[int(len(df)*0.8):]
return {'train': trainvals, 'test': testvals}
dfs = [split_dfs(df) for g,df in df.groupby['itemid']]
train_df = pd.concat([x['train'] for x in dfs])
test_df = pd.concat(x['test'] for x in dfs])
Upvotes: 2