user1355179
user1355179

Reputation: 99

How to select columns in dataframe based on lookup in another?

I'm working with a dataset with a large number of predictors, and want to easily test different composite variable groupings by using a control file. For starters, the control file would indicate whether or not to include a variable. Here's an example:

control = pd.DataFrame({'Variable': ['Var1','Var2','Var3'], 
                    'Include': [1,0,1]})

control
Out[48]: 
   Include Variable
0        1     Var1
1        0     Var2
2        1     Var3

data = pd.DataFrame({'Sample':['a','b','c'],
                    'Var1': [1,0,0],
                    'Var2': [0,1,0],
                    'Var3': [0,0,1]})

data
Out[50]: 
  Sample  Var1  Var2  Var3
0      a     1     0     0
1      b     0     1     0
2      c     0     0     1

So the result after processing should be a new dataframe which looks like data, but drops the Var2 column:

data2
Out[51]: 
  Sample  Var1  Var3
0      a     1     0
1      b     0     0
2      c     0     1

I can get this to work by selectively dropping columns using .itterows():

data2 = data.copy()
for index, row in control.iterrows():
    if row['Include'] != 1:
       z = (row['Variable'])
       data2.drop(z, axis=1,inplace="True")

This works, but it seems there should be a way to do this on the whole dataframe at once. Something like:

data2 = data[control['Include'] == 1]

However, this filters out rows based on the 'Include' value, not columns.

Any suggestions appreciated.

Upvotes: 1

Views: 270

Answers (2)

piRSquared
piRSquared

Reputation: 294228

This should be a pretty fast solution using numpy and reconstruction

# get data columns values which is a numpy array
dcol = data.columns.values
# find the positions where control Include are non-zero
non0 = np.nonzero(control.Include.values)
# slice control.Variable to get names of Variables to include
icld = control.Variable.values[non0]
# search the column names of data for the included Variables
# and the Sample column to get the positions to slice
srch = dcol.searchsorted(np.append('Sample', icld))
# reconstruct the dataframe using the srch slice we've created
pd.DataFrame(data.values[:, srch], data.index, dcol[srch])

enter image description here

Upvotes: 0

DYZ
DYZ

Reputation: 57033

Select the necessary headers from the control frame and use smart selection from the data:

headers = control[control['Include']==1]['Variable']
all_headers = ['Sample'] + list(headers)
data[all_headers]
#  Sample  Var1  Var3
#0      a     1     0
#1      b     0     0
#2      c     0     1

A side note: Consider using boolean True and False instead of 0s and 1s in the Include column, if possible.

Upvotes: 2

Related Questions