Pad
Pad

Reputation: 911

Removing incomplete years from pandas dataframe

I want to remove incomplete years from a dataframe to analyse the complete years further. I've looked here but the question is old and unresolved. This post follows on from a previous one where some of this code came from.

Dataframe:

dates = pd.date_range('20070101',periods=3200)
df = pd.DataFrame(np.random.randn(3200), columns =list ('A'))
df['date'] = dates
df = df[['date','A']]
df.head()

        date         A
0 2007-01-01 -0.961392
1 2007-01-02  0.832136
2 2007-01-03  1.007776
3 2007-01-04 -1.144121
4 2007-01-05 -0.850436

Index by year, count the days

count = df.set_index([df['date'].dt.year, df['date'].dt.day]).count(level=0)
count

      date    A
date           
2007   365  365
2008   366  366
2009   365  365
2010   365  365
2011   365  365
2012   366  366
2013   365  365
2014   365  365
2015   278  278

Drop all the years less than 365 days:

count = count.drop(count[count.date < 365].index)

I now want to 'uncount' the data so it returns to looking like this:

        date         A
0 2007-01-01 -0.961392
1 2007-01-02  0.832136
2 2007-01-03  1.007776
3 2007-01-04 -1.144121
4 2007-01-05 -0.850436

but doesn't include 2015 as the year is too short.

ANSWER

Create a years variable

years = count[count['date'] >= 365].index

reformat the dataframe

df = df[df['date'].dt.year.isin(years)]
df

Out[11]: 
           date         A
0    2007-01-01 -0.095152
1    2007-01-02 -0.560478
2    2007-01-03  0.793423
...    ....    ....   ...
2919 2014-12-29 -1.616931
2920 2014-12-30 -3.499278
2921 2014-12-31  0.913521

Upvotes: 1

Views: 1735

Answers (1)

EdChum
EdChum

Reputation: 394003

Get the years and pass to your original df and call isin and pass dt.year:

In [93]:
years = count[count['date'] >= 365].index
df[df['date'].dt.year.isin(years)]

Out[93]:
           date         A
0    2007-01-01  0.020745
1    2007-01-02  0.030024
2    2007-01-03 -0.385793
3    2007-01-04 -0.737720
4    2007-01-05  0.089707
5    2007-01-06 -0.820141
6    2007-01-07 -0.081740
7    2007-01-08  0.233265
8    2007-01-09  1.336224
9    2007-01-10  0.570297
10   2007-01-11 -0.280080
11   2007-01-12 -1.582950
12   2007-01-13  0.494927
13   2007-01-14  2.065250
14   2007-01-15 -2.406877
15   2007-01-16  0.124046
16   2007-01-17 -1.015604
17   2007-01-18  1.480173
18   2007-01-19  0.705919
19   2007-01-20 -2.014657
20   2007-01-21  0.130874
21   2007-01-22 -0.138736
22   2007-01-23  1.874702
23   2007-01-24 -0.170154
24   2007-01-25 -1.548015
25   2007-01-26 -0.878455
26   2007-01-27 -0.871497
27   2007-01-28  1.992482
28   2007-01-29  0.565247
29   2007-01-30  1.257662
...         ...       ...
2892 2014-12-02 -1.052277
2893 2014-12-03  0.123017
2894 2014-12-04 -0.970947
2895 2014-12-05 -0.821208
2896 2014-12-06 -0.027118
2897 2014-12-07 -0.100033
2898 2014-12-08  0.954733
2899 2014-12-09  0.388998
2900 2014-12-10  0.667443
2901 2014-12-11  1.580804
2902 2014-12-12  0.724011
2903 2014-12-13 -2.156507
2904 2014-12-14  0.736236
2905 2014-12-15  0.863674
2906 2014-12-16 -0.204992
2907 2014-12-17  0.976307
2908 2014-12-18  1.456367
2909 2014-12-19 -0.516854
2910 2014-12-20 -0.140291
2911 2014-12-21  1.467225
2912 2014-12-22  0.957542
2913 2014-12-23  2.061477
2914 2014-12-24  0.202104
2915 2014-12-25  0.806140
2916 2014-12-26 -0.478380
2917 2014-12-27  1.109158
2918 2014-12-28 -0.598417
2919 2014-12-29 -1.283922
2920 2014-12-30  0.546390
2921 2014-12-31 -0.640812

[2922 rows x 2 columns]

This will filter the df so that only those dates with full complement of days remain

Upvotes: 2

Related Questions