Reputation: 3848
By default, pandas.read_csv()
mangles duplicate CSV headers in the form of X, X.1, ..., X.N.
The flag mangle_dupe_cols = False
returns:
ValueError: Setting mangle_dupe_cols=False is not supported yet
This to me seems foolish as to why you would allow a Boolean flag but disable the Boolean-ness...
Is there a way through the pandas API that is best suited for detecting/removing duplicate headers upon DataFrame initialization through the read_csv()
method?
Possible Solution (non-API):
headers = set(df.columns) # will include 'X' and 'X.1'
n = 0
unique = []
for i in range(len(headers)):
d = 1
if i != n:
continue # skip dupes
elif (headers[i] + '.' + str(d)) in headers:
while True: # find next unique element
d += 1
n += 1
if (headers[i] + '.' + str(d)) not in headers:
break
unique.append(headers[i])
n += 1
Upvotes: 5
Views: 4613
Reputation:
Both of the previous answers remove duplicate columns, this is not what I would want/expect from mangle_dupe_cols=False, rather I would want to just avoid the suffix and to keep the duplicate columns, for this you could use:
df.columns = ~df.columns.str.replace("(\.\d+)$", "")
after importing the dataframe.
But this only works if you only have a single header consisting of strings, otherwise you would have to do something more fancy.
Upvotes: 0
Reputation: 93
Here is my one-liner, assume there is no dot character in your column names.
df = df.loc[:,df.columns.str.find('.') < 0]
Upvotes: 1
Reputation: 353329
I might do something like
In [22]: df = pd.read_csv("dup.col")
In [23]: df
Out[23]:
A A.1 B C C.1 C.2 D E C.3
0 1 2 3 4 5 6 7 8 9
1 1 2 3 4 5 6 7 8 9
In [24]: df.loc[:, ~df.columns.str.replace("(\.\d+)$", "").duplicated()]
Out[24]:
A B C D E
0 1 3 4 7 8
1 1 3 4 7 8
(With a little more uglification we could handle non-string columns as well, but maybe YAGNI.)
Upvotes: 5