pstatix
pstatix

Reputation: 3848

Removing duplicate columns from pandas.read_csv()

By default, pandas.read_csv() mangles duplicate CSV headers in the form of X, X.1, ..., X.N.

The flag mangle_dupe_cols = False returns:

ValueError: Setting mangle_dupe_cols=False is not supported yet

This to me seems foolish as to why you would allow a Boolean flag but disable the Boolean-ness...

Is there a way through the pandas API that is best suited for detecting/removing duplicate headers upon DataFrame initialization through the read_csv() method?

Possible Solution (non-API):

headers = set(df.columns) # will include 'X' and 'X.1'
n = 0
unique = []
for i in range(len(headers)):
    d = 1
    if i != n:
        continue # skip dupes
    elif (headers[i] + '.' + str(d)) in headers:
        while True: # find next unique element
            d += 1
            n += 1
            if (headers[i] + '.' + str(d)) not in headers:
                break
    unique.append(headers[i])
    n += 1

Upvotes: 5

Views: 4613

Answers (3)

user9413641
user9413641

Reputation:

Both of the previous answers remove duplicate columns, this is not what I would want/expect from mangle_dupe_cols=False, rather I would want to just avoid the suffix and to keep the duplicate columns, for this you could use:

df.columns = ~df.columns.str.replace("(\.\d+)$", "")

after importing the dataframe.

But this only works if you only have a single header consisting of strings, otherwise you would have to do something more fancy.

Upvotes: 0

hevangel
hevangel

Reputation: 93

Here is my one-liner, assume there is no dot character in your column names.

df = df.loc[:,df.columns.str.find('.') < 0]

Upvotes: 1

DSM
DSM

Reputation: 353329

I might do something like

In [22]: df = pd.read_csv("dup.col")

In [23]: df
Out[23]: 
   A  A.1  B  C  C.1  C.2  D  E  C.3
0  1    2  3  4    5    6  7  8    9
1  1    2  3  4    5    6  7  8    9

In [24]: df.loc[:, ~df.columns.str.replace("(\.\d+)$", "").duplicated()]
Out[24]: 
   A  B  C  D  E
0  1  3  4  7  8
1  1  3  4  7  8

(With a little more uglification we could handle non-string columns as well, but maybe YAGNI.)

Upvotes: 5

Related Questions