Pandas dataframe: how to find missing years in a timeseries?

Question

I have a DataFrame with a timestamp index and some 100,000 rows. Via

df['year'] = df.index.year

it is easy to create a new column which contains the year of each row. Now I want to find out which years are missing from my timeseries. So far, I understand that I can use groupby to obtain "something" which allows me to find the unique values. Thus,

grouped = df.groupby('year')
grouped.groups.keys()

will give me the years which are present in my dataset. I could now build a complete year vector with

pd.date_range(df.index.min(), df.index.max(), freq='AS')

and through reindex I should then be able to find the missing years as those years which have NaN values.

However, this sounds awfully complicated for such seemingly simple task, and the grouped.groups operation actually takes quite a while; presumably, because it doesn't only look for unique keys, but also builds the index lists of rows that belong to each key, which is a feature that I don't need here.

Is there any way to obtain the unique elements of a dataframe column more directly/efficiently?

EdChum · Accepted Answer

One method would be to construct a series of the years of interest and then use isin to see the missing values:

In [89]:

year_s = pd.Series(np.arange(1993, 2015))
year_s
Out[89]:
0     1993
1     1994
2     1995
3     1996
4     1997
5     1998
6     1999
7     2000
8     2001
9     2002
10    2003
11    2004
12    2005
13    2006
14    2007
15    2008
16    2009
17    2010
18    2011
19    2012
20    2013
21    2014
dtype: int32

In [88]:

df = pd.DataFrame({'year':[1999, 2000, 2013]})
df
Out[88]:
   year
0  1999
1  2000
2  2013

In [91]:

year_s[~year_s.isin(df['year'])]
Out[91]:
0     1993
1     1994
2     1995
3     1996
4     1997
5     1998
8     2001
9     2002
10    2003
11    2004
12    2005
13    2006
14    2007
15    2008
16    2009
17    2010
18    2011
19    2012
21    2014
dtype: int32

So in your case you can generate the year series as above, then for your df you can get the years using:

df.index.year.unique()

which will be much quicker than performing a groupby.

Take care that the last value passed to arange is not included in the range

Pandas dataframe: how to find missing years in a timeseries?

Answers (2)

Related Questions