Reputation: 4136
I would like to find consecutive nans in my dataframe columns, something like
>>> df = pd.DataFrame([[np.nan, 2, np.nan],
... [3, 4, np.nan],
... [np.nan, np.nan, np.nan],
... [np.nan, 3, np.nan]],
... columns=list('ABC'))
>>> df
A B C
0 NaN 2.0 NaN
1 3.0 4.0 NaN
2 NaN NaN NaN
3 NaN 3.0 NaN
would give
>>> df
A B C
0 1.0 NaN 4.0
1 NaN NaN 4.0
2 2.0 1.0 4.0
3 2.0 NaN 4.0
Upvotes: 3
Views: 1433
Reputation: 323236
IIUC...groupby
+ mask
+ isnull
df.apply(lambda x :x.groupby(x.isnull().diff().ne(0).cumsum()).transform(len).mask(~x.isnull()))
Out[751]:
A B C
0 1.0 NaN 4.0
1 NaN NaN 4.0
2 2.0 1.0 4.0
3 2.0 NaN 4.0
For one column
df.A.groupby(df.A.isnull().diff().ne(0).cumsum()).transform(len).mask(~df.A.isnull())
Out[756]:
0 1.0
1 NaN
2 2.0
3 2.0
Name: A, dtype: float64
Upvotes: 2
Reputation: 862661
Use:
a = df.isnull()
b = a.ne(a.shift()).cumsum().apply(lambda x: x.map(x.value_counts())).where(a)
print (b)
A B C
0 1.0 NaN 4
1 NaN NaN 4
2 2.0 1.0 4
3 2.0 NaN 4
Detail:
#unique consecutive values
print (a.ne(a.shift()).cumsum())
A B C
0 1 1 1
1 2 1 1
2 3 2 1
3 3 3 1
#count values per columns and map
print (a.ne(a.shift()).cumsum().apply(lambda x: x.map(x.value_counts())))
A B C
0 1 2 4
1 1 2 4
2 2 1 4
3 2 1 4
#add NaNs by mask a
print (a.ne(a.shift()).cumsum().apply(lambda x: x.map(x.value_counts())).where(a))
A B C
0 1.0 NaN 4
1 NaN NaN 4
2 2.0 1.0 4
3 2.0 NaN 4
One column alternative:
a = df['A'].isnull()
b = a.ne(a.shift()).cumsum()
c = b.map(b.value_counts()).where(a)
print (c)
0 1.0
1 NaN
2 2.0
3 2.0
Name: A, dtype: float64
Upvotes: 3
Reputation: 25397
Not sure if that's too elegant but that's how I made it:
def f(ds):
ds = ds.isnull()
splits = np.split(ds, np.where(ds == False)[0])
counts = [np.sum(v) for v in splits]
return pd.concat([pd.Series(split).replace({False: np.nan, True: count})
for split, count in zip(splits, counts)])
df.apply(lambda x: f(x))
Explanation:
# Binarize the array
ds = ds.isnull()
# Split the array where we have False (former nan values)
splits = np.split(ds, np.where(ds == False)[0])
# Now just count the number of True values
counts = [np.sum(v) for v in splits]
# Concatenate series that contains the requested values
pd.concat([pd.Series(split).replace({False: np.nan, True: count})
for split, count in zip(splits, counts)])
Upvotes: 2