How to remove columns with repeated name but keeping data

Question

I am using pandas dataframe for a dataset where the attributes are English words. After stemming the words, I have multiple columns with the same name. Here are sample data snap, after stemming, the accept, acceptable and accepted become accept. I want to use bitwise_or on all columns with the same name and delete the repeated one. I tried this code

import numpy
from nltk.stem import *
import pandas as pd
ps = PorterStemmer()
dataset = pd.read_csv('sampleData.csv')
stemmed_words = []

for w in list(dataset):
    stemmed_words.append(ps.stem(w))

dataset.columns = stemmed_words
new_word = stemmed_words[0]

for w in stemmed_words:
    if new_word == w:
         numpy.bitwise_or(dataset[new_word], dataset[w])
         del dataset[w]
     else:
         new_word = w

print(dataset)

The problem is that when the for loop execute

del dataset['accept']

It deletes all the column with this name. and I do not know in advance that how many columns will have the same name and this code generate an exception KeyError: 'accept'

I want to apply bitwise_or on all three accept columns, save it into a new column named 'accept' and del the old columns.

I hope that I will not be downvoted this time

Here is sample data:

able  abundance  academy  accept  accept  accept  access  accommodation  accompany Class
   0          0        0       0       0       1       1              0          0     C
   0          0        0       1       0       0       0              0          0     A
   0          0        0       0       1       0       0              0          0     H
   0          0        0       0       0       1       0              1          0     G
   0          0        0       1       0       0       0              0          0     G

The output should be

Class  able  abundance  academy  accept  access  accommodation  accompany
    C     0          0        0       1       1              0          0
    A     0          0        0       1       0              0          0
    H     0          0        0       1       0              0          0
    G     0          0        0       1       0              1          0
    G     0          0        0       1       0              0          0

MaxU - stand with Ukraine · Accepted Answer

IIUC you can group by column names (axis=1).

Data Frame:

In [101]: df
Out[101]:
   able  abundance  academy  accept  accept  accept  access  accommodation  accompany Class
0     0          0        0       0       0       1       1              0          0     C
1     0          0        0       1       0       0       0              0          0     A
2     0          0        0       0       1       0       0              0          0     H
3     0          0        0       0       0       1       0              1          0     G
4     0          0        0       1       0       0       0              0          0     G

Solution:

In [103]: df.pop('Class').to_frame() \
     ...:   .join(df.groupby(df.columns, axis=1).any(1).mul(1))
Out[103]:
  Class  able  abundance  academy  accept  access  accommodation  accompany
0     C     0          0        0       1       1              0          0
1     A     0          0        0       1       0              0          0
2     H     0          0        0       1       0              0          0
3     G     0          0        0       1       0              1          0
4     G     0          0        0       1       0              0          0

or even better solution (@ayhan, thank you for the hint!):

In [114]: df = df.pop('Class').to_frame().join(df.groupby(df.columns, axis=1).max())

In [115]: df
Out[115]:
  Class  able  abundance  academy  accept  access  accommodation  accompany
0     C     0          0        0       1       1              0          0
1     A     0          0        0       1       0              0          0
2     H     0          0        0       1       0              0          0
3     G     0          0        0       1       0              1          0
4     G     0          0        0       1       0              0          0

How to remove columns with repeated name but keeping data

Answers (1)

Related Questions