AmyChodorowski
AmyChodorowski

Reputation: 402

Interesting results with duplicate columns in pandas.DataFrame

Can anyone help to explain why I get errors in some actions and not others when there is a duplicate column in a pandas.DataFrame.

Minimal, Reproducible Example

import pandas as pd
df = pd.DataFrame(columns=['a', 'b', 'b'])

If I try and insert a list into column 'a' I get an error about dimension mis-match:

df.loc[:, 'a'] = list(range(5))

Traceback (most recent call last):
...
ValueError: cannot copy sequence with size 5 to array axis with dimension 0

Similar with 'b':

df.loc[:, 'b'] = list(range(5))

Traceback (most recent call last):
...
ValueError: could not broadcast input array from shape (5) into shape (0,2)

However if I insert into an entirely new column, I don't get an error, unless I insert into 'a' or 'b':

df.loc[:, 'c'] = list(range(5))
print(df)

     a    b    b  c
0  NaN  NaN  NaN  0
1  NaN  NaN  NaN  1
2  NaN  NaN  NaN  2
3  NaN  NaN  NaN  3
4  NaN  NaN  NaN  4

df.loc[:, 'a'] = list(range(5))

Traceback (most recent call last):
...
ValueError: Buffer has wrong number of dimensions (expected 1, got 0)

All of these errors disappear if I remove the duplicate column 'b'


Additional information

pandas==1.0.2

Upvotes: 8

Views: 576

Answers (1)

Janneman
Janneman

Reputation: 353

Why use loc and not just:

df['a'] = list(range(5))

This gives no error and seems to produce what you need:

a   b   b
0   NaN NaN 
1   NaN NaN 
2   NaN NaN 
3   NaN NaN 
4   NaN NaN 

same for creating column c:

df['c'] = list(range(5))

Upvotes: 1

Related Questions