ibra
ibra

Reputation: 1304

How to split datatable dataframe into train and test dataset in python

I am using datatable dataframe. How can I split the dataframe into train and test dataset?
Similarly to pandas dataframe, I tried to use train_test_split(dt_df,classes) from sklearn.model_selection, but it doesn't work and I get error.

import datatable as dt
import numpy as np
from sklearn.model_selection import train_test_split

dt_df = dt.fread(csv_file_path)
classe = dt_df[:, "classe"])
del dt_df[:, "classe"])

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

I get the following error : TypeError: Column selector must be an integer or a string, not <class 'numpy.ndarray'>

I try a work around method by converting the dataframe to numpy array:

classe = np.ravel(dt_df[:, "classe"])
dt_df = dt_df.to_numpy()

Like that it works, but, I don't know if there is a way allowing the train_test_split working correctly like in pandas dataframe.

Edit 1: The csv file contain as columns strings, and the values are unsigned int. Using print(dt_df) we get :

     | CCC  CCG  CCU  CCA  CGC  CGG  CGU  CGA  CUC  CUG  …  
---- + ---  ---  ---  ---  ---  ---  ---  ---  ---  ---     
   0 |   0    0    0    0    2    0    1    0    0    1  …  
   1 |   0    0    0    0    1    0    2    1    0    1  …  
   2 |   0    0    0    1    1    0    1    0    1    2  …  
   3 |   0    0    0    1    1    0    1    0    1    2  …  
   4 |   0    0    0    1    1    0    1    0    1    2  …  
   5 |   0    0    0    1    1    0    1    0    1    2  …  
   6 |   0    0    0    1    0    0    3    0    0    2  …  
   7 |   0    0    0    1    1    0    0    0    1    2  …  
   8 |   0    0    0    1    1    0    1    0    1    2  …  
   9 |   0    0    1    0    1    0    1    0    1    3  …  
  10 |   0    0    1    0    1    0    1    0    1    3  …  
      ...

Thanks for you help.

Upvotes: 5

Views: 17135

Answers (3)

Bob_D_Builder
Bob_D_Builder

Reputation: 111

Here is a simple function I made using only pandas. The sample function randomly and uniformly selects rows (axis=0) in the dataframe for the test set. The rows for the training set can be selected by dropping the rows in the original dataframe with the same indexes as the test set.

def train_test_split(df, frac=0.2):
    
    # get random sample 
    test = df.sample(frac=frac, axis=0)

    # get everything but the test sample
    train = df.drop(index=test.index)

    return train, test

Upvotes: 8

ibra
ibra

Reputation: 1304

The solution I use to split datatable dataframe into train and test dataset in python using train_test_split(dt_df,classes) from sklearn.model_selection is to convert the datatable dataframe to numpy as I mentioned in my question post, or to pandas dataframe as commented by @Manoor Hassan (to and back again):

source code before split method:

import datatable as dt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier

dt_df = dt.fread(csv_file_path)

classe = np.ravel(dt_df[:, "classe"])
del dt_df[:, "classe"])

source code after split method:

ExTrCl = ExtraTreesClassifier()
ExTrCl.fit(X_train, y_train)
pred_test = ExTrCl.predict(X_test)

method 1: convert to numpy

# source code before split method

dt_df = dt_df.to_numpy()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

# source code after split method

method 2: convert to numpy and return back to datatable dataframe after the split:

# source code before split method

dt_df = dt_df.to_numpy()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

X_train = dt.Frame(X_train)

# source code after split method

method 3: convert to pandas dataframe

# source code before split method

dt_df = dt_df.to_pandas()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

# source code after split method

These 3 methods work fine, but there is a difference in the time performance of the train (ExTrCl.fit) and the prediction (ExTrCl.predict), for a csv file of about 500 Mo I have these results:

                       T convert    T.train     T.pred
M1 to_numpy             3           85          0.5
M2 to_numpy and back    3.5         29          0.5
M3 to pandas            4           37          4

enter image description here

Upvotes: 0

Manoor Hassan
Manoor Hassan

Reputation: 31

i don't know about a function that can split dt. but you can us

dt_df = df.read_csv(csv_file_path)
classe = dt_df[:, "classe"])
del dt_df[:, "classe"])

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

and then convert the DataFame to DataTable by:

X_train = dt.Frame(X_train)
X_test = dt.Frame(X_test)

Upvotes: 3

Related Questions