pyCthon
pyCthon

Reputation: 12341

Correct way of iterating over pandas dataframe by date

I want to iterate over a dataframe's major axis date by date.

Example:

tdf = df.ix[date]

The issue I am having is that the type returned by df.ix changes, leaving me with 3 possible situations

  1. If the date does not exist in tdf an error is thrown: KeyError: 1394755200000000000

  2. If there is only one item in tdf: print type(tdf) returns <class 'pandas.core.series.Series'>

  3. If there is more than one item in tdf: print type(tdf) returns <class 'pandas.core.frame.DataFrame'>

To avoid the first case I can simply wrap this in a try catch block or thanks to jxstanford, I can avoid the try catch block by using if date in df.index:

I run into the issue afterwards with an inconsistent API with a pandas series and a pandas data frame. I could solve this by checking for types but it seems I shouldn't have to do that. I would ideally like to keep the types the same. Is there a better way of doing this?

I'm running pandas 0.13.1 and I am currently loading my data from a CSV using

Here's a full example demonstrating the problem.

from pandas import DataFrame
import datetime

path_to_csv = '/home/n/Documents/port/test.csv'

df = DataFrame.from_csv(path_to_csv, index_col=3, header=0, parse_dates=True, sep=',')

start_dt = df.index.min()
end_dt = df.index.max()
dt_step = datetime.timedelta(days=1)

df.sort_index(inplace=True)

cur_dt = start_dt

while cur_dt != end_dt:
    if cur_dt in df.index:
       print type(df.ix[cur_dt])
    #run some other steps using cur_dt

    cur_dt += dt_step

An example CSV that demonstrates the problem is as follows:

value1,value2,value3,Date,type
1,2,4,03/13/14,a
2,3,3,03/21/14,b
3,4,2,03/21/14,a
4,5,1,03/27/14,b

The above code prints out

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>

Is it possible to get the value of value1 from tdf in a consistent manner? or am I stuck making an if statement for and separately handle each case?

if type(df.ix[cur_dt]) == DataFrame:
    ....
if type(df.ix[cur_dt]) == Series:
    ....

Upvotes: 3

Views: 5384

Answers (3)

dmvianna
dmvianna

Reputation: 15718

This toy code will return DataFrames consistently.

def framer(rows):
    if ndim(rows) == 1:
        return rows.to_frame().T
    else:
        return rows

for cur_date in df.index:
    print type(framer(df.ix[cur_date]))

And this will give you the missing days:

df.resample(rule='D')

Have a look at the resample method docstring. It has its own options to fill up the missing data. And if you decide to make your multiple dates into a single one, the method you're looking at is groupby (if you want to combine values across rows) and drop_duplicates (if you want to ignore them). There is no need to reinvent the wheel.

Upvotes: 1

mtadd
mtadd

Reputation: 2555

You can use the apply method of the DataFrame, using axis = 1 to work on each row of the DataFrame to build a Series with the same Index.

e.g.

def calculate_value(row):
   if row.date == pd.datetime(2014,3,21):
      return 0
   elif row.type == 'a':
      return row.value1 + row.value2 + row.value3
   else:
      return row.value1 * row.value2 * row.value3

df['date'] = df.index
df['NewValue'] = df.apply(calculate_value, axis=1)

modifies your example input as follows

            value1  value2  value3 type  NewValue       date
Date
2014-03-13       1       2       4    a         7 2014-03-13
2014-03-21       2       3       3    b         0 2014-03-21
2014-03-21       3       4       2    a         0 2014-03-21
2014-03-27       4       5       1    b        20 2014-03-27

[4 rows x 6 columns]

Upvotes: 1

jxstanford
jxstanford

Reputation: 3387

Not sure what your trying to do with the dataframe, but this might be better than a try/except:

tdf = DataFrame.from_csv(path_to_csv, index_col=3, header=0, parse_dates=True, sep=',')
while cur_dt != end_dt:
    if cur_dt in df.index:
       # do your thing

    cur_dt += dt_step

Upvotes: 2

Related Questions