Reputation: 417
I have a dataframe that looks like this:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2016910 entries, 2009-01-02 04:51:00 to 2012-11-02 20:00:00
Freq: T
Data columns:
X1 2016910 non-null values
X2 2016910 non-null values
X3 2016910 non-null values
X4 2016910 non-null values
X5 2016910 non-null values
dtypes: float64(5)
and I would like to "filter" it by accessing only certain times across the whole range of dates. For example, I'd like to return a dataframe that contains all rows where the time is between 13:00:00 and 14:00:00, but for all of the dates. I am reading the data from a CSV file and the datetime is one column, but I could just as easily make the input CSV file contain a separate date and time. I tried the separate date and time route, and created a multiindex, but when I did, I ended up with two index columns -- one of them containing the proper date with an incorrect time instead of just a date, and the second one containing an incorrect date, and then a correct time, instead of just a time. The input data for my multiindex attempt looked like this:
20090102,04:51:00,89.9900,89.9900,89.9900,89.9900,100
20090102,05:36:00,90.0100,90.0100,90.0100,90.0100,200
20090102,05:44:00,90.1400,90.1400,90.1400,90.1400,100
20090102,05:50:00,90.0500,90.0500,90.0500,90.0500,500
20090102,05:56:00,90.1000,90.1000,90.1000,90.1000,300
20090102,05:57:00,90.1000,90.1000,90.1000,90.1000,200
which I tried to read using this code:
singledf = pd.DataFrame.from_csv("inputfile",header=None,index_col=[0,1],parse_dates=True)
which resulted in a dataframe that looks like this:
singledf.sort()
singledf
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 716244 entries, (<Timestamp: 2009-01-02 00:00:00>, <Timestamp: 2012-11-04 04:51:00>) to (<Timestamp: 2012-11-02 00:00:00>, <Timestamp: 2012-11-04 20:00:00>)
Data columns:
X2 716244 non-null values
X3 716244 non-null values
X4 716244 non-null values
X5 716244 non-null values
X6 716244 non-null values
dtypes: float64(4), int64(1)
Maybe the multiindex approach is totally wrong, but it's one thing I tried. It seems like it is stuck on using a datetime object, and wants to force the index columns to have a datetime instead of just a date or a time. My source CSV files for the my non-multiindex attempt looks like this:
20090102 04:51:00,89.9900,89.9900,89.9900,89.9900,100
20090102 05:36:00,90.0100,90.0100,90.0100,90.0100,200
20090102 05:44:00,90.1400,90.1400,90.1400,90.1400,100
20090102 05:50:00,90.0500,90.0500,90.0500,90.0500,500
20090102 05:56:00,90.1000,90.1000,90.1000,90.1000,300
I am using pandas .9. Any suggestions are appreciated!
Upvotes: 1
Views: 4510
Reputation: 69096
A regular DatetimeIndex allows to use between_time method.
In [12]: data = """\
20090102,04:51:00,89.9900,89.9900,89.9900,89.9900,100
20090102,05:36:00,90.0100,90.0100,90.0100,90.0100,200
20090102,05:44:00,90.1400,90.1400,90.1400,90.1400,100
20090102,05:50:00,90.0500,90.0500,90.0500,90.0500,500
20090102,05:56:00,90.1000,90.1000,90.1000,90.1000,300
20090102,05:57:00,90.1000,90.1000,90.1000,90.1000,200
"""
In [13]: singledf = pd.DataFrame.from_csv(StringIO(data), header=None, parse_dates=[[0,1]])
In [14]: singledf
Out[14]:
X2 X3 X4 X5 X6
X0_X1
2009-01-02 04:51:00 89.99 89.99 89.99 89.99 100
2009-01-02 05:36:00 90.01 90.01 90.01 90.01 200
2009-01-02 05:44:00 90.14 90.14 90.14 90.14 100
2009-01-02 05:50:00 90.05 90.05 90.05 90.05 500
2009-01-02 05:56:00 90.10 90.10 90.10 90.10 300
2009-01-02 05:57:00 90.10 90.10 90.10 90.10 200
In [15]: singledf.between_time('5:30:00', '5:45:00')
Out[15]:
X2 X3 X4 X5 X6
X0_X1
2009-01-02 05:36:00 90.01 90.01 90.01 90.01 200
2009-01-02 05:44:00 90.14 90.14 90.14 90.14 100
Upvotes: 2