Reputation: 1680
I have a csv file like so:
Id col1 col2 col3
a 7.04 0.3 1.2
b 0.3 1.7 .1
c 0.34 0.05 1.3
d 0.4 1.60 3.1
I want to convert it to a data frame thresholding on 0.5 . If the values is greater than or equal to 0.5 then the column is counted , otherwise it is not counted.
Id classes
a col1,col3
b col2
c col3
d col2,col3
The closest solution I found is this one. However it deals with single rows, not multiple rows. For multiple rows, the best I have is iterate through all the rows. I need a succinct expression without for loop.
Upvotes: 1
Views: 360
Reputation: 294198
Comprehension after clever multiplication... This assumes that Id
is the index.
df.assign(classes=[
','.join(s for s in row if s)
for row in df.ge(.5).mul(df.columns).values
])
col1 col2 col3 classes
Id
a 7.04 0.30 1.2 col1,col3
b 0.30 1.70 0.1 col2
c 0.34 0.05 1.3 col3
d 0.40 1.60 3.1 col2,col3
Setup Fun Trick
Custom subclass of str
that redefines string addition to include a ','
class s(str):
def __add__(self, other):
if self and other:
return s(super().__add__(',' + other))
else:
return s(super().__add__(other))
Fun Trick
df.ge(.5).mul(df.columns).applymap(s).sum(1)
Id
a col1,col3
b col2
c col3
d col2,col3
dtype: object
Upvotes: 2
Reputation: 862406
Use set_index
first and then numpy.where
for extract columns by condition. Last remove empty strings by list comprehension
:
df = df.set_index('Id')
s = np.where(df > .5, ['{}, '.format(x) for x in df.columns], '')
df['new'] = pd.Series([''.join(x).strip(', ') for x in s], index=df.index)
print (df)
col1 col2 col3 new
Id
a 7.04 0.30 1.2 col1, col3
b 0.30 1.70 0.1 col2
c 0.34 0.05 1.3 col3
d 0.40 1.60 3.1 col2, col3
Similar for new DataFrame
:
df1 = pd.DataFrame({'classes': [''.join(x).strip(', ') for x in s],
'Id': df.index})
print (df1)
Id classes
0 a col1, col3
1 b col2
2 c col3
3 d col2, col3
And if necessary remove empty with ,
:
df1 = pd.DataFrame({'classes': [''.join(x).strip(', ').replace(', ',',') for x in s],
'Id': df.index})
print (df1)
Id classes
0 a col1,col3
1 b col2
2 c col3
3 d col2,col3
Detail:
print (s)
[['col1, ' '' 'col3, ']
['' 'col2, ' '']
['' '' 'col3, ']
['' 'col2, ' 'col3, ']]
Alternative with apply
(slowier):
df1 = (df.set_index('Id')
.apply(lambda x: ','.join(x.index[x > .5]), 1)
.reset_index(name='classes'))
print (df1)
Id classes
0 a col1,col3
1 b col2
2 c col3
3 d col2,col3
Upvotes: 2