Niv
Niv

Reputation: 940

Adding a Series to DataFrame results in strange behavior

My train set is 307,511 rows and test set is 48,744. I combined them into one dataframe (named 'data') which is 356255 rows.

I created a series that flags whether an item belongs to train or test set.

trainlen = pd.Series([1]*len(train)+[0]*len(test))

Its length is 356255, as expected.

When I add it to the dataset I get strange behavior:

data = pd.concat([train,test])
data['isTrain'] = trainlen

While trainlen.sum() returns 307,511 (as it should), data.isTrain.sum() returns 356,255.

It's only when I use 'values':

data['isTrain'] = trainlen.values

That data.isTrain.sum() returns 307511.

Can you explain why this is happening?

Upvotes: 0

Views: 58

Answers (1)

Nikaido
Nikaido

Reputation: 4629

The problem is with the indexes. When you use the concat method for the two dataframes the indexes will be concatenated generating something like this for your df's index:

[0, 1, 2, ..., 307510, 0, 1, 2, 3, ... 48743]

As you can see the indexes are starting from 0 again at some point. But because also your series has indexes, when you do the assignment like this:

data['isTrain'] = trainlen # I am doing an assignment with a Series object that contains also indexes!

the data in the series will match only the indexes present in your dataframe, generating a list of only '1's. (for your series, the values matching the indexes [0, 1, 2, 3, ... 48743] are '1's)

your df will be something like this

        inst  isTrain
0          0        1
1          1        1
2          2        1
3          3        1
4          4        1
...      ...      ...
48739  48739        1
48740  48740        1
48741  48741        1
48742  48742        1
48743  48743        1

Can you see that the indexes are wrong? But if you change the indexes of your df before the assignment it will work:

data.index = [i for i in range(len(data))] # here I am changing\resetting the indexes
data['isTrain'] = trainlen

print(trainlen.sum())
print(data.isTrain.sum())

now indexes and values are correct!

         inst  isTrain
0           0        1
1           1        1
2           2        1
3           3        1
4           4        1
...       ...      ...
356250  48739        0
356251  48740        0
356252  48741        0
356253  48742        0
356254  48743        0

When you use trainlen.values instead, you are not using the series indexes. So you are safe in the assignment!

Upvotes: 1

Related Questions