Reputation: 37
I guess I know why I am getting this error. It's because the inner list does not match the outer list. It just did not click on how to deal with this problem
The code is pretty easy I got data frame df which has many columns. I want to drop all columns that have more than 70% zero data however that rule will have to apply for columns after column 22.
df = df.loc[:, (df.iloc[:, 22:]==0).mean() < 0.7]
Upvotes: 0
Views: 7257
Reputation: 23217
You got the error because the 2nd parameter you passed to df.loc
is a boolean array but since it is based on the slice [22:]
, it is shorter than the column index of df
itself. Hence, when this shorter boolean array is presented to df
itself in df.loc
for its boolean indexing, df
is unable to work based on a shorter array.
You can mitigate this by simply using:
df.iloc[:, 22:].loc[:, (df != 0).mean() < 0.7]
It works for df
with a shorter portion to see a boolean array of longer length but not the other way round.
If you just want to retain your original dataframe with the portion of only columns starting from 22:
, you can reassign it to your original dataframe name, as follows:
df = df.iloc[:, 22:].loc[:, (df != 0).mean() < 0.7]
However, if you want your final dataframe contains also the columns from 0:22
, you can .join()
the columns in front with those filtered columns, as follows:
df1 = df.iloc[:, 22:].loc[:, (df != 0).mean() < 0.7]
df = df.iloc[:, :22].join(df1)
Upvotes: 1
Reputation: 120391
# Same as pipe
# m = (df.iloc[:, 22:] != 0).mean() < 0.7
# m = m.loc[m].index
m = ((df.iloc[:, 22:] != 0).mean() < 0.7).pipe(lambda m: m.loc[m].index)
df = df.drop(columns=m)
>>> m
Index(['c22', 'c23', 'c28', 'c30', 'c32', 'c35', 'c38', 'c39'], dtype='object')
>>> df.columns
Index(['c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10',
'c11', 'c12', 'c13', 'c14', 'c15', 'c16', 'c17', 'c18', 'c19', 'c20',
'c21', 'c24', 'c25', 'c26', 'c27', 'c29', 'c31', 'c33', 'c34', 'c36',
'c37'],
dtype='object')
Upvotes: 1