controlfreak
controlfreak

Reputation: 3385

How to count the number of elements in a set of rows selected based on a condition?

I have a large DataFrame with many duplicate values. The unique values are stored in List1. I'd like to do the following:

  1. Select a few rows that contain each of the values present in the list.
  2. Iterate over the selected rows and count the number of non NaN elements
  3. If the count value is greater than or equal to 2, store it in a new list. Each component in List1 should be added to eq_list only if all the count values for the 'eq' are >=2.

A simplified sample input:

List1 = ['A','B','C','D','E','F','G','H','X','Y','Z']

Sample DF 'ABC':

        EQ1  EQ2   EQ3
0       A    NaN   NaN
1       X    Y     NaN
2       A    X     C
3       D    E     F
4       G    H     B

DESIRED OUTPUT:

eq_list = ['B','C','D','E','F','G','H','X','Y']

The codelet I tried:

for eq in List1:
    MCS=ABC.loc[MCS_old[:] ==eq]
    MCS = MCS.reset_index(drop=True)
    for index_new in range(0,len(MCS)-1):
        if int(MCS.iloc[[index_new]].count(axis=1))>2:
            eq_list.append(raw_input(eq))
            print(eq_list)

I hope that I have made the issue clear.

Upvotes: 3

Views: 553

Answers (2)

Stefan
Stefan

Reputation: 42875

The below identifies the set of (unique) values that occur in rows with more than 2 non-NaN values, eliminates those that also occur in rows with less than 2 nonNaN values. Avoids using loops.

First, get set of unique values in the part of df that does not meet the missing values restriction (and adding .strip() to address a data issue mentioned in the comments):

na_threshold = 1
not_enough_non_nan = df[df.count(axis=1) <= 1].values.flatten().astype(str)
not_enough_non_nan = set([str(l).strip() for l in not_enough_non_nan if not l == 'nan'])

{'A'}

Next, identify the set of values that do meet your restriction:

enough_non_nan = df[df.count(axis=1) > 1].values.flatten().astype(str)
enough_non_nan = set([str(l).strip() for l in enough_non_nan if not l == 'nan'])

{'H', 'C', 'E', 'B', 'D', 'X', 'F', 'A', 'Y', 'G'}

Finally, take the set difference between the above to eliminate values do not always meet the restriction:

result = sorted(enough_non_nan - not_enough_non_nan)

['B', 'C', 'D', 'E', 'F', 'G', 'H', 'X', 'Y']

Upvotes: 1

Sergey Bushmanov
Sergey Bushmanov

Reputation: 25189

Suppose you have:

df
    EQ1 EQ2 EQ3
0   A   NaN NaN
1   X   Y   NaN
2   A   X   C
3   D   E   F
4   G   H   B

Then, you may proceed as follows:

dft = df.T
output_set = set()
prune_set =  set()
for column in dft:
    arr = dft[column].dropna().values
    if len(arr) >=2:
        output_set |= set(arr)
    else:
        prune_set |= set(arr)
sorted(output_set - prune_set)
['B', 'C', 'D', 'E', 'F', 'G', 'H', 'X', 'Y']

Upvotes: 1

Related Questions