Reputation: 562
I'm following Jeff Heaton Deep Learning course Course link
and gowing with the exercises as well. In the 1st Programming Assignment, question 5, in the last task he asks to write to a csv file the output of the KFold(5)
cross validated input data. Basically after the KFold(5) we will have 5 sets of data (train/test). What I need is to glue them together and save them in a file. I tried to do it, but my data overwrites and I only get the last fold data in my output.
I'm starting out with Python and I guess I can't see how to make the loop to do pandas append correctly.
Link to the exercises: Exercise 5
part of my code: df_car
is an original pandas df. df_cars1
is a copy of df_cars
, but I make it empty in order to later append data
kf = KFold(5)
# df_cars.insert(0,'set', 'str')
df_cars.insert(1,'iteration', 0)
df_cars1 = pd.DataFrame(data=None, columns=df_cars.columns,index=df_cars.index)
df_cars1.dropna()
fold = 1
for train_index, validate_index in kf.split(df_cars):
trainDF = pd.DataFrame(df_cars.ix[train_index])
validateDF = pd.DataFrame(df_cars.ix[validate_index])
trainDF[['set', 'iteration']] = 'T', fold
validateDF[['set', 'iteration']] = 'V', fold
print("Fold #{}, Training Size: {}, Validation Size: {}".format(fold,len(trainDF),len(validateDF)))
fold+=1
df_cars1 = pd.concat([validateDF,trainDF])
df_cars1.to_csv("./data/auto-mpg-kfold5.csv")
print(df_cars1)
Sample of my output is:
mpg iteration set cylinders displacement horsepower weight acceleration year origin name
319 0.997344 5 V 4 -0.705077 -0.767632 -0.506545 0.701436 80 3 mazda 626
320 1.727537 5 V 4 -0.714680 -0.322309 -0.634239 -0.206262 80 3 datsun 510 hatchback
321 1.112638 5 V 4 -0.820308 -0.767632 -0.834055 -0.133646 80 3 toyota corolla
322 2.957335 5 V 4 -1.031565 -1.029586 -1.017318 0.846667 80 3 mazda glc
As you see, all the values in the column iteration
are values of 5
, meaning that only the last, the 5th KFold gets appended to df_cars1
and I need all five folds there.
Any help would be appreciated
Upvotes: 2
Views: 3352
Reputation: 36599
The problem is in last line of your for loop
...
df_cars1 = pd.concat([validateDF,trainDF])
What this is doing is reassigning the df_cars1
variable with the current train and validation data and the info about old iteration is lost.
Change it to:
df_cars1 = pd.concat([df_cars1, validateDF,trainDF])
So that it combines the previous folds with this one. Hope it helps.
Upvotes: 1