Reputation: 337
I am using pandas dataframe for a dataset where the attributes are English words. After stemming the words, I have multiple columns with the same name. Here are sample data snap, after stemming, the accept, acceptable and accepted
become accept
. I want to use bitwise_or on all columns with the same name and delete the repeated one. I tried this code
import numpy
from nltk.stem import *
import pandas as pd
ps = PorterStemmer()
dataset = pd.read_csv('sampleData.csv')
stemmed_words = []
for w in list(dataset):
stemmed_words.append(ps.stem(w))
dataset.columns = stemmed_words
new_word = stemmed_words[0]
for w in stemmed_words:
if new_word == w:
numpy.bitwise_or(dataset[new_word], dataset[w])
del dataset[w]
else:
new_word = w
print(dataset)
The problem is that when the for loop execute
del dataset['accept']
It deletes all the column with this name. and I do not know in advance that how many columns will have the same name and this code generate an exception KeyError: 'accept'
I want to apply bitwise_or on all three accept
columns, save it into a new column named 'accept' and del the old columns.
I hope that I will not be downvoted this time
Here is sample data:
able abundance academy accept accept accept access accommodation accompany Class
0 0 0 0 0 1 1 0 0 C
0 0 0 1 0 0 0 0 0 A
0 0 0 0 1 0 0 0 0 H
0 0 0 0 0 1 0 1 0 G
0 0 0 1 0 0 0 0 0 G
The output should be
Class able abundance academy accept access accommodation accompany
C 0 0 0 1 1 0 0
A 0 0 0 1 0 0 0
H 0 0 0 1 0 0 0
G 0 0 0 1 0 1 0
G 0 0 0 1 0 0 0
Upvotes: 1
Views: 105
Reputation: 210982
IIUC you can group by column names (axis=1
).
Data Frame:
In [101]: df
Out[101]:
able abundance academy accept accept accept access accommodation accompany Class
0 0 0 0 0 0 1 1 0 0 C
1 0 0 0 1 0 0 0 0 0 A
2 0 0 0 0 1 0 0 0 0 H
3 0 0 0 0 0 1 0 1 0 G
4 0 0 0 1 0 0 0 0 0 G
Solution:
In [103]: df.pop('Class').to_frame() \
...: .join(df.groupby(df.columns, axis=1).any(1).mul(1))
Out[103]:
Class able abundance academy accept access accommodation accompany
0 C 0 0 0 1 1 0 0
1 A 0 0 0 1 0 0 0
2 H 0 0 0 1 0 0 0
3 G 0 0 0 1 0 1 0
4 G 0 0 0 1 0 0 0
or even better solution (@ayhan, thank you for the hint!):
In [114]: df = df.pop('Class').to_frame().join(df.groupby(df.columns, axis=1).max())
In [115]: df
Out[115]:
Class able abundance academy accept access accommodation accompany
0 C 0 0 0 1 1 0 0
1 A 0 0 0 1 0 0 0
2 H 0 0 0 1 0 0 0
3 G 0 0 0 1 0 1 0
4 G 0 0 0 1 0 0 0
Upvotes: 0