kuatroka
kuatroka

Reputation: 562

sklearn KFold() - save all the folds to the csv file

I'm following Jeff Heaton Deep Learning course Course link

and gowing with the exercises as well. In the 1st Programming Assignment, question 5, in the last task he asks to write to a csv file the output of the KFold(5) cross validated input data. Basically after the KFold(5) we will have 5 sets of data (train/test). What I need is to glue them together and save them in a file. I tried to do it, but my data overwrites and I only get the last fold data in my output. I'm starting out with Python and I guess I can't see how to make the loop to do pandas append correctly. Link to the exercises: Exercise 5

part of my code: df_car is an original pandas df. df_cars1 is a copy of df_cars, but I make it empty in order to later append data

kf = KFold(5)
#     df_cars.insert(0,'set', 'str')
df_cars.insert(1,'iteration', 0)
df_cars1 = pd.DataFrame(data=None, columns=df_cars.columns,index=df_cars.index)
df_cars1.dropna()

fold = 1
for train_index, validate_index in kf.split(df_cars):        
    trainDF = pd.DataFrame(df_cars.ix[train_index])
    validateDF = pd.DataFrame(df_cars.ix[validate_index])
    trainDF[['set', 'iteration']] = 'T', fold
    validateDF[['set', 'iteration']] = 'V', fold
    print("Fold #{}, Training Size: {}, Validation Size: {}".format(fold,len(trainDF),len(validateDF)))
    fold+=1
    df_cars1 = pd.concat([validateDF,trainDF])

df_cars1.to_csv("./data/auto-mpg-kfold5.csv")
print(df_cars1)

Sample of my output is:

    mpg iteration   set cylinders   displacement    horsepower  weight  acceleration    year    origin  name
319 0.997344    5   V   4   -0.705077   -0.767632   -0.506545   0.701436    80  3   mazda 626
320 1.727537    5   V   4   -0.714680   -0.322309   -0.634239   -0.206262   80  3   datsun 510 hatchback
321 1.112638    5   V   4   -0.820308   -0.767632   -0.834055   -0.133646   80  3   toyota corolla
322 2.957335    5   V   4   -1.031565   -1.029586   -1.017318   0.846667    80  3   mazda glc

As you see, all the values in the column iteration are values of 5, meaning that only the last, the 5th KFold gets appended to df_cars1 and I need all five folds there. Any help would be appreciated

Upvotes: 2

Views: 3352

Answers (1)

Vivek Kumar
Vivek Kumar

Reputation: 36599

The problem is in last line of your for loop

...
df_cars1 = pd.concat([validateDF,trainDF])

What this is doing is reassigning the df_cars1 variable with the current train and validation data and the info about old iteration is lost.

Change it to:

df_cars1 = pd.concat([df_cars1, validateDF,trainDF])

So that it combines the previous folds with this one. Hope it helps.

Upvotes: 1

Related Questions