Reputation: 940
My train set is 307,511 rows and test set is 48,744. I combined them into one dataframe (named 'data') which is 356255 rows.
I created a series that flags whether an item belongs to train or test set.
trainlen = pd.Series([1]*len(train)+[0]*len(test))
Its length is 356255, as expected.
When I add it to the dataset I get strange behavior:
data = pd.concat([train,test])
data['isTrain'] = trainlen
While trainlen.sum()
returns 307,511 (as it should), data.isTrain.sum()
returns 356,255.
It's only when I use 'values':
data['isTrain'] = trainlen.values
That data.isTrain.sum()
returns 307511.
Can you explain why this is happening?
Upvotes: 0
Views: 58
Reputation: 4629
The problem is with the indexes. When you use the concat
method for the two dataframes the indexes will be concatenated generating something like this for your df's index:
[0, 1, 2, ..., 307510, 0, 1, 2, 3, ... 48743]
As you can see the indexes are starting from 0 again at some point. But because also your series has indexes, when you do the assignment like this:
data['isTrain'] = trainlen # I am doing an assignment with a Series object that contains also indexes!
the data in the series will match only the indexes present in your dataframe, generating a list of only '1's. (for your series, the values matching the indexes [0, 1, 2, 3, ... 48743] are '1's)
your df will be something like this
inst isTrain
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
... ... ...
48739 48739 1
48740 48740 1
48741 48741 1
48742 48742 1
48743 48743 1
Can you see that the indexes are wrong? But if you change the indexes of your df before the assignment it will work:
data.index = [i for i in range(len(data))] # here I am changing\resetting the indexes
data['isTrain'] = trainlen
print(trainlen.sum())
print(data.isTrain.sum())
now indexes and values are correct!
inst isTrain
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
... ... ...
356250 48739 0
356251 48740 0
356252 48741 0
356253 48742 0
356254 48743 0
When you use trainlen.values
instead, you are not using the series indexes. So you are safe in the assignment!
Upvotes: 1