Reputation: 99
I'm working with a dataset with a large number of predictors, and want to easily test different composite variable groupings by using a control file. For starters, the control file would indicate whether or not to include a variable. Here's an example:
control = pd.DataFrame({'Variable': ['Var1','Var2','Var3'],
'Include': [1,0,1]})
control
Out[48]:
Include Variable
0 1 Var1
1 0 Var2
2 1 Var3
data = pd.DataFrame({'Sample':['a','b','c'],
'Var1': [1,0,0],
'Var2': [0,1,0],
'Var3': [0,0,1]})
data
Out[50]:
Sample Var1 Var2 Var3
0 a 1 0 0
1 b 0 1 0
2 c 0 0 1
So the result after processing should be a new dataframe which looks like data, but drops the Var2 column:
data2
Out[51]:
Sample Var1 Var3
0 a 1 0
1 b 0 0
2 c 0 1
I can get this to work by selectively dropping columns using .itterows():
data2 = data.copy()
for index, row in control.iterrows():
if row['Include'] != 1:
z = (row['Variable'])
data2.drop(z, axis=1,inplace="True")
This works, but it seems there should be a way to do this on the whole dataframe at once. Something like:
data2 = data[control['Include'] == 1]
However, this filters out rows based on the 'Include' value, not columns.
Any suggestions appreciated.
Upvotes: 1
Views: 270
Reputation: 294228
This should be a pretty fast solution using numpy and reconstruction
# get data columns values which is a numpy array
dcol = data.columns.values
# find the positions where control Include are non-zero
non0 = np.nonzero(control.Include.values)
# slice control.Variable to get names of Variables to include
icld = control.Variable.values[non0]
# search the column names of data for the included Variables
# and the Sample column to get the positions to slice
srch = dcol.searchsorted(np.append('Sample', icld))
# reconstruct the dataframe using the srch slice we've created
pd.DataFrame(data.values[:, srch], data.index, dcol[srch])
Upvotes: 0
Reputation: 57033
Select the necessary headers from the control
frame and use smart selection from the data
:
headers = control[control['Include']==1]['Variable']
all_headers = ['Sample'] + list(headers)
data[all_headers]
# Sample Var1 Var3
#0 a 1 0
#1 b 0 0
#2 c 0 1
A side note: Consider using boolean True
and False
instead of 0s and 1s in the Include
column, if possible.
Upvotes: 2