Reputation: 89
My CSV file looks something like this
location StartDate EndDate
Austin 10/24/20. 10/31/20
Austin 11/28/20. 12/05/20
Austin 12/26/20. 01/02/21
Austin 10/10/20 10/17/20
Austin 10/03/20. 10/10/20
Kansas 10/24/20. 10/31/20
Kansas 11/28/20. 12/05/20
Kansas 12/26/20. 01/02/21
Kansas 10/03/20. 10/10/20
Tampa 10/24/20. 10/31/20
Tampa 11/28/20. 12/05/20
Tampa 10/03/20. 10/10/20
As you can see Kansas is missing 10/10/20 - 10/17/20 and Tampa is missing 2 records for 10/10 and 12/26. Is there a way to find this missing records from the file using Pandas and python?
Upvotes: 2
Views: 480
Reputation: 5183
You can use unstack
and stack(dropna=False)
df = df.groupby(['StartDate', 'EndDate', 'location']).size().unstack()
df = df.stack(dropna=False).rename('count').reset_index()
missing = df[df['count'].isna()]
Output
print(missing)
StartDate EndDate location count
4 10/10/20 10/17/20 Kansas NaN
5 10/10/20 10/17/20 Tampa NaN
14 12/26/20. 01/02/21 Tampa NaN
Basically you are making a square matrix for all StartDate
and all location
. When you unstack
pandas places a NaN
if the combination of row/column labels is not in the dataframe. Then when you stack
pandas drops those NaN
by default, but you can pass dropna
parameter to keep them precisely for this use case.
Upvotes: 3
Reputation: 150785
Let's try pivot
and unstack
:
(df.pivot(*df)
.stack(dropna=False)
.loc[lambda x: x.isna()]
)
Output:
location StartDate
Kansas 2020-10-10 NaT
Tampa 2020-10-10 NaT
2020-12-26 NaT
dtype: datetime64[ns]
Upvotes: 4