Reputation: 636
I have Pandas Series where the are some missing data represented by nans, and would like to get an overview of for how long the the data is generally missing and how many times
An example would be:
10:01 1.23
10:02 2.23
10:03 nan
10:04 nan
10:05 nan
10:06 6.23
10:07 nan
10:08 nan
10:09 9.23
Then the desire output would be a list like this
missing = [[10:03,10:05], [10:07,10:08]]
N_missing = 2
Upvotes: 2
Views: 133
Reputation: 74
if columns of the dataframe are stored with these name
missing=df[np.isnan(f['value'])]
no_missing=len(missing)
missing
date value
10:03 NaN
10:04 NaN
10:05 NaN
10:07 NaN
10:08 NaN
no_missing
5
Upvotes: 0
Reputation: 862591
Use:
#create DataFrame
df = df.reset_index()
df.columns = ['A','B']
#boolean mask for check no NaNs to variable for reuse
m = df['B'].notnull()
#create index by cumulative sum for unique groups for consecutive NaNs
df.index = m.cumsum()
#filter only NaNs row and aggregate first and last value, convert to list
missing = df[~m.values].groupby(level=0)['A'].agg(['first','last']).values.tolist()
print (missing)
[['10:03', '10:05'], ['10:07', '10:08']]
#get length of nested lists
N_missing = len(missing)
print (N_missing)
2
Detail:
print (df[~m.values])
A B
B
2 10:03 NaN
2 10:04 NaN
2 10:05 NaN
3 10:07 NaN
3 10:08 NaN
Similar solution with Series
:
m = s.notnull()
cum = m.cumsum()
missing = s[~m.values].index.to_series().groupby(cum).agg(['first','last']).values.tolist()
print (missing)
[['10:03', '10:05'], ['10:07', '10:08']]
N_missing = len(missing)
print (N_missing)
2
Upvotes: 3