Reputation: 402
Can anyone help to explain why I get errors in some actions and not others when there is a duplicate column in a pandas.DataFrame
.
Minimal, Reproducible Example
import pandas as pd
df = pd.DataFrame(columns=['a', 'b', 'b'])
If I try and insert a list into column 'a'
I get an error about dimension mis-match:
df.loc[:, 'a'] = list(range(5))
Traceback (most recent call last):
...
ValueError: cannot copy sequence with size 5 to array axis with dimension 0
Similar with 'b'
:
df.loc[:, 'b'] = list(range(5))
Traceback (most recent call last):
...
ValueError: could not broadcast input array from shape (5) into shape (0,2)
However if I insert into an entirely new column, I don't get an error, unless I insert into 'a'
or 'b'
:
df.loc[:, 'c'] = list(range(5))
print(df)
a b b c
0 NaN NaN NaN 0
1 NaN NaN NaN 1
2 NaN NaN NaN 2
3 NaN NaN NaN 3
4 NaN NaN NaN 4
df.loc[:, 'a'] = list(range(5))
Traceback (most recent call last):
...
ValueError: Buffer has wrong number of dimensions (expected 1, got 0)
All of these errors disappear if I remove the duplicate column 'b'
Additional information
pandas==1.0.2
Upvotes: 8
Views: 576
Reputation: 353
Why use loc and not just:
df['a'] = list(range(5))
This gives no error and seems to produce what you need:
a b b
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
same for creating column c:
df['c'] = list(range(5))
Upvotes: 1