at8865
at8865

Reputation: 107

How to subset a dataframe by the number of nans in each row to the right of a select column in python?

I have an order book in the form of a dataframe that looks like this:

import pandas as pd
import numpy as np

months = list(range(1, 13))
li = list(map(str, months))
cols = ['ID']
cols.extend(li)

df = pd.DataFrame(np.random.randint(0,1000,size=(10, 13)), columns=cols)
df.loc[[1,2],'1':'12'] = np.nan
df.loc[3,'7':'12'] = np.nan
df.loc[5,'5':'12'] = np.nan
df.loc[7,'3':'8'] = np.nan
df.loc[9,'3':'10'] = np.nan

    ID      1      2      3      4      5      6      7      8      9     10     11     12
0  328   45.0  226.0  388.0  286.0  557.0  930.0  234.0  418.0  863.0  500.0  232.0  116.0
1  340    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN
2  865    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN
3  313  947.0  272.0  936.0  501.0  241.0  731.0    NaN    NaN    NaN    NaN    NaN    NaN
4  293  772.0  185.0    6.0  284.0  522.0  826.0  995.0  370.0   87.0  668.0  469.0   40.0
5  226   31.0  994.0  896.0  889.0    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN
6  622  128.0  337.0   67.0  763.0  999.0  853.0  172.0  927.0  460.0  602.0  134.0  115.0
7  454  407.0    1.0    NaN    NaN    NaN    NaN    NaN    NaN   33.0   60.0  112.0  127.0
8  538  968.0  924.0  113.0  162.0  416.0   16.0   88.0  631.0  516.0  593.0   65.0  574.0
9  501  949.0  709.0    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN  272.0  220.0

As it stands I loop through the columns and then the rows, populating each cell dependant on the value in the cell to the left.

Assuming that I am in month '5', meaning that I'm only interested in this section of the dataframe:

    ID       5      6      7      8      9     10     11     12
0  328     557.0  930.0  234.0  418.0  863.0  500.0  232.0  116.0
1  340       NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN
2  865       NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN
3  313     241.0  731.0    NaN    NaN    NaN    NaN    NaN    NaN
4  293     522.0  826.0  995.0  370.0   87.0  668.0  469.0   40.0
5  226       NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN
6  622     999.0  853.0  172.0  927.0  460.0  602.0  134.0  115.0
7  454       NaN    NaN    NaN    NaN   33.0   60.0  112.0  127.0
8  538     416.0   16.0   88.0  631.0  516.0  593.0   65.0  574.0
9  501       NaN    NaN    NaN    NaN    NaN    NaN  272.0  220.0

and given a variable term_len = 6, is there an efficient way to identify which ID's/indexes have 6 continuous cells with NaN in them from column '5'?

What I would expect is the rows [1, 2, 5, 9] to be identified. Not index 4 because it only has 4 NaNs, nor index 3 because the row doesn't start with NaN.

The only way I can think to do it:

month = 5
subset = df.loc[:, str(month):]
term_len  = 6
idxs = pd.to_numeric(subset.apply(pd.Series.first_valid_index,axis=1))
idxsT = idxs - month - term_len
idxsT.index[(idxsT >= 0) | (idxsT.isna())]

Out: Int64Index([1, 2, 5, 9], dtype='int64')

Is there an alternative way of approaching this problem?

Upvotes: 2

Views: 175

Answers (1)

David Erickson
David Erickson

Reputation: 16683

  1. Create a mask called m that filters for those specific columns with df.iloc[:,6:11].
  2. Then, you could add .isnull() to the mask, which will return True or False for all the cells in those columns depending on whether or not they are null.
  3. The final part of the mask is to take the .sum and pass axis=1, that will provide you with the sum of all True values for all columns for each row, because True=1 and False=0, since True/False is a boolean data type.So, m, gives you a series of the count of True values for the selected columns.
  4. The last step is to simply filter the entire dataframe based of this mask df = df[m == 5] filter for the rows that have 5 True values, i.e. rows that are all NaN for the specified columns.

code:

m = df.iloc[:,6:11].isnull().sum(axis=1)
df = df[m == 5]

output:

    ID  1       2       3   4   5   6   7   8   9   10  11      12
1   340 NaN     NaN     NaN NaN NaN NaN NaN NaN NaN NaN NaN     NaN
2   865 NaN     NaN     NaN NaN NaN NaN NaN NaN NaN NaN NaN     NaN
5   226 31.0    994.0   896.0   889.0   NaN NaN NaN NaN NaN     NaN NaN NaN
9   501 949.0   709.0   NaN NaN NaN NaN NaN NaN NaN NaN 272.0   220.0

Upvotes: 2

Related Questions