Midi
Midi

Reputation: 153

How can I split a Dataset from a .csv file for Training and Testing?

I'm using Python and I need to split my .csv imported data in two parts, a training and test set, E.G 70% training and 30% test.

I keep getting various errors, such as 'list' object is not callable and so on.

Is there any easy way of doing this?

Thanks

EDIT:

The code is basic, I'm just looking to split the dataset.

from csv import reader
with open('C:/Dataset.csv', 'r') as f:
    data = list(reader(f)) #Imports the CSV
    data[0:1] ( data )

TypeError: 'list' object is not callable

Upvotes: 13

Views: 50616

Answers (5)

sstrcom
sstrcom

Reputation: 1

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.read_csv("in.csv")
indices = np.arange(len(df))
indices_train, indices_test = train_test_split(indices, test_size = 0.3)
df_train = df.iloc[indices_train]
df_test = df.iloc[indices_test]
df_train.to_csv("train.csv")
df_test.to_csv("test.csv")

Upvotes: -1

rounak
rounak

Reputation: 79

You should use sklearn.model_selection.train_test_split as its the best for purpose of splitting a dataset below i'm giving code to use it

`

import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv('C:/Dataset.csv')
x_train,x_test,y_train,y_test = train_test_split(data["qus"], 
data["ans"],test_size = 0.3)

train_data = pd.concat([x_train , y_train], axis = 1)
test_data = pd.concat([x_train , y_train], axis = 1)
train_data.head()

`

Assuming that your csv contains 2 columns one for question and other for answer

Upvotes: 0

dr_dronych
dr_dronych

Reputation: 91

You should use the read_csv () function from the pandas module. It reads all your data straight into the dataframe which you can use further to break your data into train and test. Equally, you can use the train_test_split() function from the scikit-learn module.

Upvotes: 5

Flair
Flair

Reputation: 2907

Better practice and maybe more random is to use df.sample:

from numpy.random import RandomState
import pandas as pd

df = pd.read_csv('C:/Dataset.csv')
rng = RandomState()

train = df.sample(frac=0.7, random_state=rng)
test = df.loc[~df.index.isin(train.index)]

Upvotes: 10

zipa
zipa

Reputation: 27869

You can use pandas:

import pandas as pd
import numpy as np

df = pd.read_csv('C:/Dataset.csv')
df['split'] = np.random.randn(df.shape[0], 1)

msk = np.random.rand(len(df)) <= 0.7

train = df[msk]
test = df[~msk]

Upvotes: 32

Related Questions