motiur
motiur

Reputation: 1680

Select column index in pandas dataframe based on values

I have a csv file like so:

Id               col1    col2    col3     
a                 7.04   0.3     1.2    
b                 0.3    1.7     .1   
c                 0.34   0.05    1.3   
d                 0.4    1.60    3.1

I want to convert it to a data frame thresholding on 0.5 . If the values is greater than or equal to 0.5 then the column is counted , otherwise it is not counted.

Id               classes    
a                 col1,col3    
b                 col2        
c                 col3   
d                 col2,col3     

The closest solution I found is this one. However it deals with single rows, not multiple rows. For multiple rows, the best I have is iterate through all the rows. I need a succinct expression without for loop.

Upvotes: 1

Views: 360

Answers (2)

piRSquared
piRSquared

Reputation: 294198

Comprehension after clever multiplication... This assumes that Id is the index.

df.assign(classes=[
    ','.join(s for s in row if s)
    for row in df.ge(.5).mul(df.columns).values
])

    col1  col2  col3    classes
Id                             
a   7.04  0.30   1.2  col1,col3
b   0.30  1.70   0.1       col2
c   0.34  0.05   1.3       col3
d   0.40  1.60   3.1  col2,col3

Setup Fun Trick
Custom subclass of str that redefines string addition to include a ','

class s(str):
    def __add__(self, other):
        if self and other:
            return s(super().__add__(',' + other))
        else:
            return s(super().__add__(other))

Fun Trick

df.ge(.5).mul(df.columns).applymap(s).sum(1)

Id
a    col1,col3
b         col2
c         col3
d    col2,col3
dtype: object

Upvotes: 2

jezrael
jezrael

Reputation: 862406

Use set_index first and then numpy.where for extract columns by condition. Last remove empty strings by list comprehension:

df = df.set_index('Id')
s = np.where(df > .5, ['{}, '.format(x) for x in df.columns], '')
df['new'] = pd.Series([''.join(x).strip(', ') for x in s], index=df.index)
print (df)
    col1  col2  col3         new
Id                              
a   7.04  0.30   1.2  col1, col3
b   0.30  1.70   0.1        col2
c   0.34  0.05   1.3        col3
d   0.40  1.60   3.1  col2, col3

Similar for new DataFrame:

df1 = pd.DataFrame({'classes': [''.join(x).strip(', ') for x in s], 
                    'Id': df.index})
print (df1)
  Id     classes
0  a  col1, col3
1  b        col2
2  c        col3
3  d  col2, col3

And if necessary remove empty with ,:

df1 = pd.DataFrame({'classes': [''.join(x).strip(', ').replace(', ',',') for x in s], 
                    'Id': df.index})
print (df1)

  Id    classes
0  a  col1,col3
1  b       col2
2  c       col3
3  d  col2,col3

Detail:

print (s)
[['col1, ' '' 'col3, ']
 ['' 'col2, ' '']
 ['' '' 'col3, ']
 ['' 'col2, ' 'col3, ']]

Alternative with apply (slowier):

df1 = (df.set_index('Id')
         .apply(lambda x: ','.join(x.index[x > .5]), 1)
         .reset_index(name='classes'))
print (df1)
  Id    classes
0  a  col1,col3
1  b       col2
2  c       col3
3  d  col2,col3

Upvotes: 2

Related Questions