Ronron
Ronron

Reputation: 89

How to find missing rows in csv using Pandas?

My CSV file looks something like this

location StartDate EndDate
Austin  10/24/20. 10/31/20
Austin  11/28/20. 12/05/20
Austin  12/26/20. 01/02/21
Austin  10/10/20  10/17/20
Austin  10/03/20. 10/10/20
Kansas  10/24/20. 10/31/20
Kansas  11/28/20. 12/05/20
Kansas  12/26/20. 01/02/21
Kansas  10/03/20. 10/10/20
Tampa   10/24/20. 10/31/20
Tampa   11/28/20. 12/05/20
Tampa   10/03/20. 10/10/20

As you can see Kansas is missing 10/10/20 - 10/17/20 and Tampa is missing 2 records for 10/10 and 12/26. Is there a way to find this missing records from the file using Pandas and python?

Upvotes: 2

Views: 480

Answers (2)

RichieV
RichieV

Reputation: 5183

You can use unstack and stack(dropna=False)

df = df.groupby(['StartDate', 'EndDate', 'location']).size().unstack()
df = df.stack(dropna=False).rename('count').reset_index()
missing = df[df['count'].isna()]

Output

print(missing)
    StartDate   EndDate location  count
4    10/10/20  10/17/20   Kansas    NaN
5    10/10/20  10/17/20    Tampa    NaN
14  12/26/20.  01/02/21    Tampa    NaN

Basically you are making a square matrix for all StartDate and all location. When you unstack pandas places a NaN if the combination of row/column labels is not in the dataframe. Then when you stack pandas drops those NaN by default, but you can pass dropna parameter to keep them precisely for this use case.

Upvotes: 3

Quang Hoang
Quang Hoang

Reputation: 150785

Let's try pivot and unstack:

(df.pivot(*df)
   .stack(dropna=False)
      .loc[lambda x: x.isna()]
)

Output:

location  StartDate 
Kansas    2020-10-10   NaT
Tampa     2020-10-10   NaT
          2020-12-26   NaT
dtype: datetime64[ns]

Upvotes: 4

Related Questions