Reputation: 1304
I am using datatable dataframe. How can I split the dataframe into train and test dataset?
Similarly to pandas dataframe, I tried to use train_test_split(dt_df,classes)
from sklearn.model_selection, but it doesn't work and I get error.
import datatable as dt
import numpy as np
from sklearn.model_selection import train_test_split
dt_df = dt.fread(csv_file_path)
classe = dt_df[:, "classe"])
del dt_df[:, "classe"])
X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)
I get the following error : TypeError: Column selector must be an integer or a string, not <class 'numpy.ndarray'>
I try a work around method by converting the dataframe to numpy array:
classe = np.ravel(dt_df[:, "classe"])
dt_df = dt_df.to_numpy()
Like that it works, but, I don't know if there is a way allowing the train_test_split
working correctly like in pandas dataframe.
Edit 1: The csv file contain as columns strings, and the values are unsigned int. Using print(dt_df)
we get :
| CCC CCG CCU CCA CGC CGG CGU CGA CUC CUG … ---- + --- --- --- --- --- --- --- --- --- --- 0 | 0 0 0 0 2 0 1 0 0 1 … 1 | 0 0 0 0 1 0 2 1 0 1 … 2 | 0 0 0 1 1 0 1 0 1 2 … 3 | 0 0 0 1 1 0 1 0 1 2 … 4 | 0 0 0 1 1 0 1 0 1 2 … 5 | 0 0 0 1 1 0 1 0 1 2 … 6 | 0 0 0 1 0 0 3 0 0 2 … 7 | 0 0 0 1 1 0 0 0 1 2 … 8 | 0 0 0 1 1 0 1 0 1 2 … 9 | 0 0 1 0 1 0 1 0 1 3 … 10 | 0 0 1 0 1 0 1 0 1 3 … ...
Thanks for you help.
Upvotes: 5
Views: 17135
Reputation: 111
Here is a simple function I made using only pandas. The sample function randomly and uniformly selects rows (axis=0) in the dataframe for the test set. The rows for the training set can be selected by dropping the rows in the original dataframe with the same indexes as the test set.
def train_test_split(df, frac=0.2):
# get random sample
test = df.sample(frac=frac, axis=0)
# get everything but the test sample
train = df.drop(index=test.index)
return train, test
Upvotes: 8
Reputation: 1304
The solution I use to split datatable dataframe into train and test dataset in python using train_test_split(dt_df,classes)
from sklearn.model_selection is to convert the datatable dataframe to numpy as I mentioned in my question post, or to pandas dataframe as commented by @Manoor Hassan (to and back again):
source code before split method:
import datatable as dt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
dt_df = dt.fread(csv_file_path)
classe = np.ravel(dt_df[:, "classe"])
del dt_df[:, "classe"])
source code after split method:
ExTrCl = ExtraTreesClassifier()
ExTrCl.fit(X_train, y_train)
pred_test = ExTrCl.predict(X_test)
method 1: convert to numpy
# source code before split method
dt_df = dt_df.to_numpy()
X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)
# source code after split method
method 2: convert to numpy and return back to datatable dataframe after the split:
# source code before split method
dt_df = dt_df.to_numpy()
X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)
X_train = dt.Frame(X_train)
# source code after split method
method 3: convert to pandas dataframe
# source code before split method
dt_df = dt_df.to_pandas()
X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)
# source code after split method
These 3 methods work fine, but there is a difference in the time performance of the train (ExTrCl.fit) and the prediction (ExTrCl.predict), for a csv file of about 500 Mo I have these results:
T convert T.train T.pred M1 to_numpy 3 85 0.5 M2 to_numpy and back 3.5 29 0.5 M3 to pandas 4 37 4
Upvotes: 0
Reputation: 31
i don't know about a function that can split dt
. but you can us
dt_df = df.read_csv(csv_file_path)
classe = dt_df[:, "classe"])
del dt_df[:, "classe"])
X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)
and then convert the DataFame
to DataTable
by:
X_train = dt.Frame(X_train)
X_test = dt.Frame(X_test)
Upvotes: 3