Reputation: 297
I have the below sample dataframe. Each hour has 5 instances. Is there any module or a way to find the missing data in a given column in Python. For instance for hour 2, instance 3 is missing. How can we identify this missing instance in a larger dataset dynamically in Python and print a message that an instance is missing.
Date Hour Instance
2022-10-20 1 1
2022-10-20 1 2
2022-10-20 1 3
2022-10-20 1 4
2022-10-20 1 5
2022-10-20 2 1
2022-10-20 2 2
2022-10-20 2 4
2022-10-20 2 5
Thank you.
Upvotes: 1
Views: 126
Reputation: 262114
Using a crosstab
:
df2 = pd.crosstab(df['Hour'], df['Instance'])
out = df2[df2.eq(0)].stack().reset_index()[['Hour', 'Instance']]
Output:
Hour Instance
0 2 3
Upvotes: 2
Reputation: 18315
(df.set_index(["Hour", "Instance"])
.unstack()
.isna().where(lambda fr: fr)
.stack()
.reset_index()
[["Hour", "Instance"]])
to get
Hour Instance
0 2 3
Meaning, there was only 1 instance missing, and it was Hour 2's Instance 3.
Upvotes: 1
Reputation: 1479
First, define a function that checks if there is a missing instance.
def is_data_missing(array):
"""Return True when data is missing, ie array is different from range."""
return list(array) != list(range(1, len(array) + 1))
Then you can apply it to your DataFrame, grouping by hour first.
>>> df.groupby("Hour").apply(lambda x: is_range(x["Instance"].values))
Hour
1 False
2 True
The resulting DataFrame gives you the hours when data was missing. You can then iterate through the items to print a message.
>>> for hour, is_missing in df_missing.items():
... if is_missing:
... print(f"At hour {hour} data is missing")
At hour 2 data is missing
Upvotes: 0