Reputation: 911
I want to remove incomplete years from a dataframe to analyse the complete years further. I've looked here but the question is old and unresolved. This post follows on from a previous one where some of this code came from.
Dataframe:
dates = pd.date_range('20070101',periods=3200)
df = pd.DataFrame(np.random.randn(3200), columns =list ('A'))
df['date'] = dates
df = df[['date','A']]
df.head()
date A
0 2007-01-01 -0.961392
1 2007-01-02 0.832136
2 2007-01-03 1.007776
3 2007-01-04 -1.144121
4 2007-01-05 -0.850436
Index by year, count the days
count = df.set_index([df['date'].dt.year, df['date'].dt.day]).count(level=0)
count
date A
date
2007 365 365
2008 366 366
2009 365 365
2010 365 365
2011 365 365
2012 366 366
2013 365 365
2014 365 365
2015 278 278
Drop all the years less than 365 days:
count = count.drop(count[count.date < 365].index)
I now want to 'uncount' the data so it returns to looking like this:
date A
0 2007-01-01 -0.961392
1 2007-01-02 0.832136
2 2007-01-03 1.007776
3 2007-01-04 -1.144121
4 2007-01-05 -0.850436
but doesn't include 2015 as the year is too short.
Create a years variable
years = count[count['date'] >= 365].index
reformat the dataframe
df = df[df['date'].dt.year.isin(years)]
df
Out[11]:
date A
0 2007-01-01 -0.095152
1 2007-01-02 -0.560478
2 2007-01-03 0.793423
... .... .... ...
2919 2014-12-29 -1.616931
2920 2014-12-30 -3.499278
2921 2014-12-31 0.913521
Upvotes: 1
Views: 1735
Reputation: 394003
Get the years
and pass to your original df and call isin
and pass dt.year
:
In [93]:
years = count[count['date'] >= 365].index
df[df['date'].dt.year.isin(years)]
Out[93]:
date A
0 2007-01-01 0.020745
1 2007-01-02 0.030024
2 2007-01-03 -0.385793
3 2007-01-04 -0.737720
4 2007-01-05 0.089707
5 2007-01-06 -0.820141
6 2007-01-07 -0.081740
7 2007-01-08 0.233265
8 2007-01-09 1.336224
9 2007-01-10 0.570297
10 2007-01-11 -0.280080
11 2007-01-12 -1.582950
12 2007-01-13 0.494927
13 2007-01-14 2.065250
14 2007-01-15 -2.406877
15 2007-01-16 0.124046
16 2007-01-17 -1.015604
17 2007-01-18 1.480173
18 2007-01-19 0.705919
19 2007-01-20 -2.014657
20 2007-01-21 0.130874
21 2007-01-22 -0.138736
22 2007-01-23 1.874702
23 2007-01-24 -0.170154
24 2007-01-25 -1.548015
25 2007-01-26 -0.878455
26 2007-01-27 -0.871497
27 2007-01-28 1.992482
28 2007-01-29 0.565247
29 2007-01-30 1.257662
... ... ...
2892 2014-12-02 -1.052277
2893 2014-12-03 0.123017
2894 2014-12-04 -0.970947
2895 2014-12-05 -0.821208
2896 2014-12-06 -0.027118
2897 2014-12-07 -0.100033
2898 2014-12-08 0.954733
2899 2014-12-09 0.388998
2900 2014-12-10 0.667443
2901 2014-12-11 1.580804
2902 2014-12-12 0.724011
2903 2014-12-13 -2.156507
2904 2014-12-14 0.736236
2905 2014-12-15 0.863674
2906 2014-12-16 -0.204992
2907 2014-12-17 0.976307
2908 2014-12-18 1.456367
2909 2014-12-19 -0.516854
2910 2014-12-20 -0.140291
2911 2014-12-21 1.467225
2912 2014-12-22 0.957542
2913 2014-12-23 2.061477
2914 2014-12-24 0.202104
2915 2014-12-25 0.806140
2916 2014-12-26 -0.478380
2917 2014-12-27 1.109158
2918 2014-12-28 -0.598417
2919 2014-12-29 -1.283922
2920 2014-12-30 0.546390
2921 2014-12-31 -0.640812
[2922 rows x 2 columns]
This will filter the df so that only those dates with full complement of days remain
Upvotes: 2