Reputation: 3385
I have a large DataFrame
with many duplicate values. The unique values are stored in List1
. I'd like to do the following:
List1
should be added to eq_list
only if all the count values for the 'eq' are >=2. A simplified sample input:
List1 = ['A','B','C','D','E','F','G','H','X','Y','Z']
Sample DF 'ABC':
EQ1 EQ2 EQ3
0 A NaN NaN
1 X Y NaN
2 A X C
3 D E F
4 G H B
DESIRED OUTPUT:
eq_list = ['B','C','D','E','F','G','H','X','Y']
The codelet I tried:
for eq in List1:
MCS=ABC.loc[MCS_old[:] ==eq]
MCS = MCS.reset_index(drop=True)
for index_new in range(0,len(MCS)-1):
if int(MCS.iloc[[index_new]].count(axis=1))>2:
eq_list.append(raw_input(eq))
print(eq_list)
I hope that I have made the issue clear.
Upvotes: 3
Views: 553
Reputation: 42875
The below identifies the set
of (unique) values that occur in rows with more than 2 non-NaN
values, eliminates those that also occur in rows with less than 2 nonNaN
values. Avoids using loops.
First, get set
of unique values in the part of df
that does not meet the missing values restriction (and adding .strip()
to address a data issue mentioned in the comments):
na_threshold = 1
not_enough_non_nan = df[df.count(axis=1) <= 1].values.flatten().astype(str)
not_enough_non_nan = set([str(l).strip() for l in not_enough_non_nan if not l == 'nan'])
{'A'}
Next, identify the set
of values that do meet your restriction:
enough_non_nan = df[df.count(axis=1) > 1].values.flatten().astype(str)
enough_non_nan = set([str(l).strip() for l in enough_non_nan if not l == 'nan'])
{'H', 'C', 'E', 'B', 'D', 'X', 'F', 'A', 'Y', 'G'}
Finally, take the set
difference between the above to eliminate values do not always meet the restriction:
result = sorted(enough_non_nan - not_enough_non_nan)
['B', 'C', 'D', 'E', 'F', 'G', 'H', 'X', 'Y']
Upvotes: 1
Reputation: 25189
Suppose you have:
df
EQ1 EQ2 EQ3
0 A NaN NaN
1 X Y NaN
2 A X C
3 D E F
4 G H B
Then, you may proceed as follows:
dft = df.T
output_set = set()
prune_set = set()
for column in dft:
arr = dft[column].dropna().values
if len(arr) >=2:
output_set |= set(arr)
else:
prune_set |= set(arr)
sorted(output_set - prune_set)
['B', 'C', 'D', 'E', 'F', 'G', 'H', 'X', 'Y']
Upvotes: 1