tags
tags

Reputation: 4060

Pandas.DataFrame - find the oldest date for which a value is available

I have a pandas.DataFrame object containing 2 time series. One series is much shorter than the other.

I want to determine the farther date for which a data is available in the shortest series, and remove data in the 2 columns before that date.

What is the most pythonic way to do that?

(I apologize that I don't really follow the SO guideline for submitting questions)

Here is a fragment of my dataframe:

            osr       go
Date        
1990-08-17  NaN     239.75
1990-08-20  NaN     251.50
1990-08-21  352.00  265.00
1990-08-22  353.25  274.25
1990-08-23  351.75  290.25

In this case, I want to get rid of all rows before 1990-08-21 (I add there may be NAs in one of the columns for more recent dates)

Upvotes: 2

Views: 1485

Answers (2)

sedavidw
sedavidw

Reputation: 11691

EDIT: New answer based upon comments/edits

It sounds like the data is sequential and once you have lines that don't have data you want to throw them out. This can be done easily with dropna.

df = df.dropna()

This answer assumes that once you are passed the bad rows, they stay good. Or if you don't care about dropping rows in the middle...depends on how sequential you need to be. If the data needs to be sequential and your input is well formed jezrael answer is good

Original answer

You haven't given much here by way of structure in your dataframe so I am going to make assumptions here. I'm going to assume you have many columns, two of which: time_series_1 and time_series_2 are the ones you referred to in your question and this is all stored in df

First we can find the shorter series by just using

shorter_col = df['time_series_1'] if len(df['time_series_1']) > len(df['time_series_2']) else df['time_series_2']

Now we want the last date in that

remove_date = max(shorter_col)

Now we want to remove data before that date

mask = (df['time_series_1'] > remove_date) | (df['time_series_2'] > remove_date)
df = df[mask]

Upvotes: 2

jezrael
jezrael

Reputation: 862681

You can use idxmax in inverted s by df['osr'][::-1] and then use subset of df:

print df
#               osr      go
#Date                      
#1990-08-17     NaN  239.75
#1990-08-20     NaN  251.50
#1990-08-21  352.00  265.00
#1990-08-22  353.25  274.25
#1990-08-23  351.75  290.25

s = df['osr'][::-1]
print s
#Date
#1990-08-23    351.75
#1990-08-22    353.25
#1990-08-21    352.00
#1990-08-20       NaN
#1990-08-17       NaN
#Name: osr, dtype: float64

maxnull = s.isnull().idxmax()
print maxnull
#1990-08-20 00:00:00

print df[df.index > maxnull]
#               osr      go
#Date                      
#1990-08-21  352.00  265.00
#1990-08-22  353.25  274.25
#1990-08-23  351.75  290.25

Upvotes: 2

Related Questions