SamuelNLP
SamuelNLP

Reputation: 4136

Find consecutive Nans in pandas dataframe

I would like to find consecutive nans in my dataframe columns, something like

>>> df = pd.DataFrame([[np.nan, 2, np.nan],
...                    [3, 4, np.nan],
...                    [np.nan, np.nan, np.nan],
...                    [np.nan, 3, np.nan]],
...                    columns=list('ABC'))
>>> df
     A    B   C 
0  NaN  2.0 NaN 
1  3.0  4.0 NaN 
2  NaN  NaN NaN 
3  NaN  3.0 NaN 

would give

>>> df
     A    B   C 
0  1.0  NaN 4.0 
1  NaN  NaN 4.0 
2  2.0  1.0 4.0 
3  2.0  NaN 4.0 

Upvotes: 3

Views: 1433

Answers (3)

BENY
BENY

Reputation: 323236

IIUC...groupby + mask + isnull

df.apply(lambda x :x.groupby(x.isnull().diff().ne(0).cumsum()).transform(len).mask(~x.isnull()))
Out[751]: 
     A    B    C
0  1.0  NaN  4.0
1  NaN  NaN  4.0
2  2.0  1.0  4.0
3  2.0  NaN  4.0

For one column

df.A.groupby(df.A.isnull().diff().ne(0).cumsum()).transform(len).mask(~df.A.isnull())
Out[756]: 
0    1.0
1    NaN
2    2.0
3    2.0
Name: A, dtype: float64

Upvotes: 2

jezrael
jezrael

Reputation: 862661

Use:

a = df.isnull()
b = a.ne(a.shift()).cumsum().apply(lambda x: x.map(x.value_counts())).where(a)
print (b)
     A    B  C
0  1.0  NaN  4
1  NaN  NaN  4
2  2.0  1.0  4
3  2.0  NaN  4

Detail:

#unique consecutive values
print (a.ne(a.shift()).cumsum())
   A  B  C
0  1  1  1
1  2  1  1
2  3  2  1
3  3  3  1

#count values per columns and map
print (a.ne(a.shift()).cumsum().apply(lambda x: x.map(x.value_counts())))
   A  B  C
0  1  2  4
1  1  2  4
2  2  1  4
3  2  1  4

#add NaNs by mask a
print (a.ne(a.shift()).cumsum().apply(lambda x: x.map(x.value_counts())).where(a))
     A    B  C
0  1.0  NaN  4
1  NaN  NaN  4
2  2.0  1.0  4
3  2.0  NaN  4

One column alternative:

a = df['A'].isnull()
b = a.ne(a.shift()).cumsum()
c = b.map(b.value_counts()).where(a)

print (c)
0    1.0
1    NaN
2    2.0
3    2.0
Name: A, dtype: float64

Upvotes: 3

Stefan Falk
Stefan Falk

Reputation: 25397

Not sure if that's too elegant but that's how I made it:

def f(ds):
    ds = ds.isnull()
    splits = np.split(ds, np.where(ds == False)[0])
    counts = [np.sum(v) for v in splits]
    return pd.concat([pd.Series(split).replace({False: np.nan, True: count}) 
                      for split, count in zip(splits, counts)])

df.apply(lambda x: f(x))

Explanation:

# Binarize the array
ds = ds.isnull()

# Split the array where we have False (former nan values)
splits = np.split(ds, np.where(ds == False)[0])

# Now just count the number of True values
counts = [np.sum(v) for v in splits]

# Concatenate series that contains the requested values
pd.concat([pd.Series(split).replace({False: np.nan, True: count}) 
           for split, count in zip(splits, counts)])

Upvotes: 2

Related Questions