Reputation: 153
I'm using Python and I need to split my .csv imported data in two parts, a training and test set, E.G 70% training and 30% test.
I keep getting various errors, such as 'list' object is not callable
and so on.
Is there any easy way of doing this?
Thanks
EDIT:
The code is basic, I'm just looking to split the dataset.
from csv import reader
with open('C:/Dataset.csv', 'r') as f:
data = list(reader(f)) #Imports the CSV
data[0:1] ( data )
TypeError: 'list' object is not callable
Upvotes: 13
Views: 50616
Reputation: 1
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
df = pd.read_csv("in.csv")
indices = np.arange(len(df))
indices_train, indices_test = train_test_split(indices, test_size = 0.3)
df_train = df.iloc[indices_train]
df_test = df.iloc[indices_test]
df_train.to_csv("train.csv")
df_test.to_csv("test.csv")
Upvotes: -1
Reputation: 79
You should use sklearn.model_selection.train_test_split
as its the best for purpose of splitting a dataset below i'm giving code to use it
`
import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_csv('C:/Dataset.csv')
x_train,x_test,y_train,y_test = train_test_split(data["qus"],
data["ans"],test_size = 0.3)
train_data = pd.concat([x_train , y_train], axis = 1)
test_data = pd.concat([x_train , y_train], axis = 1)
train_data.head()
`
Assuming that your csv contains 2 columns one for question and other for answer
Upvotes: 0
Reputation: 91
You should use the read_csv ()
function from the pandas module. It reads all your data straight into the dataframe which you can use further to break your data into train and test. Equally, you can use the train_test_split()
function from the scikit-learn module.
Upvotes: 5
Reputation: 2907
Better practice and maybe more random is to use df.sample
:
from numpy.random import RandomState
import pandas as pd
df = pd.read_csv('C:/Dataset.csv')
rng = RandomState()
train = df.sample(frac=0.7, random_state=rng)
test = df.loc[~df.index.isin(train.index)]
Upvotes: 10
Reputation: 27869
You can use pandas
:
import pandas as pd
import numpy as np
df = pd.read_csv('C:/Dataset.csv')
df['split'] = np.random.randn(df.shape[0], 1)
msk = np.random.rand(len(df)) <= 0.7
train = df[msk]
test = df[~msk]
Upvotes: 32