Reputation: 14689
I have a dataset that I am trying to split into a training and test set. I have made the following script to split the data as aforementioned:
import pandas as pd
import numpy as np
data_path = "/path_to_data/"
df = pd.read_csv(data_path+"product.dlm", header=0, delimiter="|")
ts = df.shape
# print "data dimension", ts
# print "product attributes \n", train.columns.values
#shuffle data set, and split to train and test set.
new_train = df.reindex(np.random.permutation(df.index))
indice_90_percent = int((ts[0]/100.0)* 90)
new_train[:indice_90_percent].to_csv('train_products.txt',header=True, sep="|")
new_train[indice_90_percent:].to_csv('test_products.txt',header=True, sep="|")
The original file looks like
label1|label2|...|labeln
371658|description|...|"some value"
the file generated by to_csv() has one extra column without a name at the begining, which looks like this
|label1|label2|...|labeln|
452488|422932|description|...|"some value"|
What am I missing?
Upvotes: 2
Views: 1453
Reputation: 14689
Adding index=False
solved the problem:
new_train[indice_90_percent:].to_csv('test_products.txt',header=True, sep="|", index=False)
Upvotes: 4