Bappa
Bappa

Reputation: 81

Drop duplicate columns based on column names

Is there a way to drop specific duplicate columns just based on their column names. Values don't matter. Following table has columns with names A, B, C, D which are repetitive. I just want to get rid of the occurrences of column named A except the first one.

A   B   C   D   A   B   C   D
18  75  92  24  88  75  49  7
30  56  42  58  60  72  36  47
85  84  28  84  68  49  71  9
71  70  97  99  63  71  1   14
82  63  67  23  74  38  74  37
87  11  47  69  52  24  51  50
46  73  85  55  69  15  3   26
47  100 6   42  26  76  88  54
81  100 79  59  4   19  5   89
93  67  83  51  66  52  52  84
29  26  58  54  31  30  52  17
48  29  7   79  75  90  13  95
77  40  16  30  41  92  98  68
56  11  39  45  65  69  4   75
93  83  98  83  9   21  66  89
2   10  4   29  44  95  53  67

The end result should look something like this

A   B   C   D   B   C   D
18  75  92  24  75  49  7
30  56  42  58  72  36  47
85  84  28  84  49  71  9
71  70  97  99  71  1   14
82  63  67  23  38  74  37
87  11  47  69  24  51  50
46  73  85  55  15  3   26
47  100 6   42  76  88  54
81  100 79  59  19  5   89
93  67  83  51  52  52  84
29  26  58  54  30  52  17
48  29  7   79  90  13  95
77  40  16  30  92  98  68
56  11  39  45  69  4   75
93  83  98  83  21  66  89
2   10  4   29  95  53  67

I have tried using this which deletes all the duplicate columns found and not just A.

df = df.loc[:,~df.columns.duplicated()]

Upvotes: 1

Views: 117

Answers (3)

Andy L.
Andy L.

Reputation: 25269

Use iloc with get_indexer_for

a = df.columns.get_indexer_for(['A'])[1:]
df_final = df.iloc[:, [c for c in range(df.shape[1]) if c not in a]]

Out[1131]:
     A    B   C   D   B   C   D
0   18   75  92  24  75  49   7
1   30   56  42  58  72  36  47
2   85   84  28  84  49  71   9
3   71   70  97  99  71   1  14
4   82   63  67  23  38  74  37
5   87   11  47  69  24  51  50
6   46   73  85  55  15   3  26
7   47  100   6  42  76  88  54
8   81  100  79  59  19   5  89
9   93   67  83  51  52  52  84
10  29   26  58  54  30  52  17
11  48   29   7  79  90  13  95
12  77   40  16  30  92  98  68
13  56   11  39  45  69   4  75
14  93   83  98  83  21  66  89
15   2   10   4  29  95  53  67

Upvotes: 1

Raf
Raf

Reputation: 1767

This should work:

# Mark all columns except `A`s as True
cols = ~(df.columns == 'A')

# Add the first occurrence of `A`:
cols[list(df.columns).index('A')] = True

# Here you get `df` only with first occurrence of `A`:
df.loc[:,cols]

Upvotes: 3

BENY
BENY

Reputation: 323396

Let us try cumcount

s = df.columns.to_series()
out = df.loc[:,~((s.groupby(s).cumcount()==1) & (s =='A'))]
Out[113]: 
     A    B   C   D   B   C   D
0   18   75  92  24  75  49   7
1   30   56  42  58  72  36  47
2   85   84  28  84  49  71   9
3   71   70  97  99  71   1  14
4   82   63  67  23  38  74  37
5   87   11  47  69  24  51  50
6   46   73  85  55  15   3  26
7   47  100   6  42  76  88  54
8   81  100  79  59  19   5  89
9   93   67  83  51  52  52  84
10  29   26  58  54  30  52  17
11  48   29   7  79  90  13  95
12  77   40  16  30  92  98  68
13  56   11  39  45  69   4  75
14  93   83  98  83  21  66  89
15   2   10   4  29  95  53  67

Upvotes: 2

Related Questions