Reputation: 1375
I have a DataFrame with a timestamp index and some 100,000 rows. Via
df['year'] = df.index.year
it is easy to create a new column which contains the year of each row. Now I want to find out which years are missing from my timeseries. So far, I understand that I can use groupby to obtain "something" which allows me to find the unique values. Thus,
grouped = df.groupby('year')
grouped.groups.keys()
will give me the years which are present in my dataset. I could now build a complete year vector with
pd.date_range(df.index.min(), df.index.max(), freq='AS')
and through reindex I should then be able to find the missing years as those years which have NaN values.
However, this sounds awfully complicated for such seemingly simple task, and the grouped.groups operation actually takes quite a while; presumably, because it doesn't only look for unique keys, but also builds the index lists of rows that belong to each key, which is a feature that I don't need here.
Is there any way to obtain the unique elements of a dataframe column more directly/efficiently?
Upvotes: 2
Views: 3025
Reputation: 15433
If all you want is a list of missing years, you can first convert your Data Series to a list and simply build a list of missing years using a list comprehension:
years = df['year'].unique()
missing_years = [y for y in range(min(years), max(years)+1) if y not in years]
Upvotes: 2
Reputation: 394459
One method would be to construct a series of the years of interest and then use isin
to see the missing values:
In [89]:
year_s = pd.Series(np.arange(1993, 2015))
year_s
Out[89]:
0 1993
1 1994
2 1995
3 1996
4 1997
5 1998
6 1999
7 2000
8 2001
9 2002
10 2003
11 2004
12 2005
13 2006
14 2007
15 2008
16 2009
17 2010
18 2011
19 2012
20 2013
21 2014
dtype: int32
In [88]:
df = pd.DataFrame({'year':[1999, 2000, 2013]})
df
Out[88]:
year
0 1999
1 2000
2 2013
In [91]:
year_s[~year_s.isin(df['year'])]
Out[91]:
0 1993
1 1994
2 1995
3 1996
4 1997
5 1998
8 2001
9 2002
10 2003
11 2004
12 2005
13 2006
14 2007
15 2008
16 2009
17 2010
18 2011
19 2012
21 2014
dtype: int32
So in your case you can generate the year series as above, then for your df you can get the years using:
df.index.year.unique()
which will be much quicker than performing a groupby
.
Take care that the last value passed to arange
is not included in the range
Upvotes: 3