Reputation: 4060
I have a dataframe called df
The columns in the dataframe can be logically grouped. Hence I grouped the column names in lists A, B, C where:
A = [column_1, column_2, column_3]
B = [column_4, column_5, column_6]
C = [column_7, column_8, column_9]
In addition to the columns column_1 to column_9, df has one more column called "filename_ID", which is used as the index and thus is not grouped. The olumns column_1 to column_9 contain only 0 and 1 values.
Now I want to filter the dataframe such that it only includes rows where there is at least one non-zero value for each group (A,B,C). As such, I only want to keep rows with the respective filename_ID that fullfill this condition.
I have managed to create a seperate dataframe for each group:
df_A = df.loc[(df[A]!=0).any(axis=1)]
df_B = df.loc[(df[B]!=0).any(axis=1)]
df_C = df.loc[(df[C]!=0).any(axis=1)]
However, I dont know how to apply all conditions simultaniously - i.e how to create one new dataframe where all rows fullfill the condition that in each logical column-group there is at least one non-zero value.
Upvotes: 2
Views: 926
Reputation: 294258
np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.randint(2, size=(10, 9)),
columns=[f"col{i + 1}" for i in range(9)]
)
df
col1 col2 col3 col4 col5 col6 col7 col8 col9
0 0 1 0 1 0 0 1 0 1
1 1 1 1 0 1 1 0 1 0
2 0 0 0 0 0 0 0 0 0
3 1 0 1 1 1 1 0 0 0
4 0 0 1 1 1 1 1 0 1
5 1 1 0 1 1 1 1 1 1
6 1 0 1 0 0 0 1 1 0
7 0 0 0 0 0 1 0 1 0
8 1 0 1 0 1 0 0 1 1
9 1 0 1 0 0 1 0 1 0
Create a dictionary
m = {
**dict.fromkeys(['col1', 'col2', 'col3'], 'A'),
**dict.fromkeys(['col4', 'col5', 'col6'], 'B'),
**dict.fromkeys(['col7', 'col8', 'col9'], 'C'),
}
Then groupby
based on axis=1
df[df.groupby(m, axis=1).any().all(1)]
col1 col2 col3 col4 col5 col6 col7 col8 col9
0 0 1 0 1 0 0 1 0 1
1 1 1 1 0 1 1 0 1 0
4 0 0 1 1 1 1 1 0 1
5 1 1 0 1 1 1 1 1 1
8 1 0 1 0 1 0 0 1 1
9 1 0 1 0 0 1 0 1 0
Notice the ones that didn't make it
col1 col2 col3 col4 col5 col6 col7 col8 col9
2 0 0 0 0 0 0 0 0 0
3 1 0 1 1 1 1 0 0 0
6 1 0 1 0 0 0 1 1 0
7 0 0 0 0 0 1 0 1 0
You could also have had columns like this:
cols = [['col1', 'col2', 'col3'], ['col4', 'col5', 'col6'], ['col7', 'col8', 'col9']]
m = {k: v for v, c in enumerate(cols) for k in c}
And performed the same groupby
Upvotes: 3
Reputation: 338
Created a csv file with sample data
Sample Input:
ID a1 a2 a3 a4 a5 a6 a7 a8 a9
1 1 1 1 1 1 1 1 1 1
2 0 0 0 1 0 0 0 1 0
3 0 1 0 0 0 0 1 0 0
4 0 0 0 0 1 0 1 0 1
5 1 1 0 1 1 1 1 0 1
6 0 0 0 0 1 0 0 1 0
7 1 0 1 1 1 0 1 1 1
8 1 1 1 0 1 1 1 0 1
9 0 0 0 1 0 1 0 0 0
10 0 0 1 0 0 0 0 0 0
11 1 0 1 0 1 1 0 1 1
12 1 1 0 1 0 1 1 0 1
import pandas as pd
df = pd.read_csv('check.csv')
df['sumA'] = df.a1+df.a2+df.a3
df['sumB'] = df.a4+df.a5+df.a6
df['sumC'] = df.a7+df.a8+df.a9
new_df = df[(df.sumA>1)&(df.sumB>1)&(df.sumC>1)]
new_df = new_df.drop(['sumA','sumB','sumC'],axis=1)
Output:
ID a1 a2 a3 a4 a5 a6 a7 a8 a9
0 1 1 1 1 1 1 1 1 1 1
4 5 1 1 0 1 1 1 1 0 1
6 7 1 0 1 1 1 0 1 1 1
7 8 1 1 1 0 1 1 1 0 1
10 11 1 0 1 0 1 1 0 1 1
11 12 1 1 0 1 0 1 1 0 1
Upvotes: 1
Reputation: 5460
Try the following:
column_groups = [A, B, C]
masks = [(df[cols] != 0).any(axis=1) for cols in column_groups]
full_mask = np.logical_and.reduce(masks)
full_df = df[full_mask]
Upvotes: 1