Geekster
Geekster

Reputation: 491

Getting rows from a data frame which satisfy a condition in pandas

I have a data frame and I have a range of numbers. I want to find the rows where values in a particular column lie in that range.

This seems like a trivial job. I tried with the techniques given here - http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-boolean

I took a simple example:

In [6]: df_s
Out[6]: 
   time  value
0     1      3
1     2      4
2     3      3
3     4      4
4     5      3
5     6      2
6     7      2
7     8      3
8     9      3

In [7]: df_s[df_s.time.isin(range(1,8))]
Out[7]: 
   time  value
0     1      3
1     2      4
2     3      3
3     4      4
4     5      3
5     6      2
6     7      2

Then, I tried with a sample from the data set I am working with which has timestamp and value as columns:

In [8]: df_s = pd.DataFrame({'time': range(1379945743841,1379945743850), 'value': [3,4,3,4,3,2,2,3,3]})

In [9]: df_s
Out[9]: 
            time  value
0  1379945743841      3
1  1379945743842      4
2  1379945743843      3
3  1379945743844      4
4  1379945743845      3
5  1379945743846      2
6  1379945743847      2
7  1379945743848      3
8  1379945743849      3

In [10]: df_s[df_s.time.isin(range(1379945743843,1379945743845))]
Out[10]: 
Empty DataFrame
Columns: [time, value]
Index: []

Why doesn't the same technique work in this case? What am I doing wrong?

I tried another approach:

In [11]: df_s[df_s.time >= 1379945743843 and df_s.time <=1379945743845]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-45c44def41b4> in <module>()
----> 1 df_s[df_s.time >= 1379945743843 and df_s.time <=1379945743845]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Then, I tried with a bit more complex approach:

In [13]: df_s.ix[[idx for idx in df_s.index if df_s.ix[idx]['time'] in range(1379945743843, 1379945743845)]]
Out[13]: 
            time  value
2  1379945743843      3
3  1379945743844      4

This gives the desired result but it takes way too much time to give any result on my original data set. It has 209920 rows and it is expected that the number of rows will increase when I actually put my code to test.

Can anyone direct to me towards the right approach?

I am using python 2.7.3 and pandas 0.12.0

Update:

Jeff's answer worked.

But I find the isin approach more simple, intuitive and less cluttered. Please comment if anyone has any idea why it failed.

Thanks!

Upvotes: 0

Views: 7421

Answers (1)

Jeff
Jeff

Reputation: 128948

Try this way

In [7]:  df_s = pd.DataFrame({'time': range(1379945743841,1379945743850), 'value': [3,4,3,4,3,2,2,3,3]})

Convert your ms epoch timestamps to actual times

In [8]: df_s['time'] = pd.to_datetime(df_s['time'],unit='ms')

In [9]: df_s
Out[9]: 
                        time  value
0 2013-09-23 14:15:43.841000      3
1 2013-09-23 14:15:43.842000      4
2 2013-09-23 14:15:43.843000      3
3 2013-09-23 14:15:43.844000      4
4 2013-09-23 14:15:43.845000      3
5 2013-09-23 14:15:43.846000      2
6 2013-09-23 14:15:43.847000      2
7 2013-09-23 14:15:43.848000      3
8 2013-09-23 14:15:43.849000      3

These are your converted endpoints

In [10]: pd.to_datetime(1379945743843,unit='ms')
Out[10]: Timestamp('2013-09-23 14:15:43.843000', tz=None)

In [11]: pd.to_datetime(1379945743845,unit='ms')
Out[11]: Timestamp('2013-09-23 14:15:43.845000', tz=None)

In [12]: df = df_s.set_index('time')

You must use the & and use parens

In [13]: df_s[(df_s.time>pd.to_datetime(1379945743843,unit='ms')) & (df_s.time<pd.to_datetime(1379945743845,unit='ms'))]
Out[13]: 
                    time  value
3 2013-09-23 14:15:43.844000      4

In 0.13 (coming soon), you will be able to do this:

In [7]: df_s.query('"2013-09-23 14:15:43.843" < time < "2013-09-23 14:15:43.845"')
Out[7]: 
                    time  value
3 2013-09-23 14:15:43.844000      4

Your isin approach DOES work. Not sure why its not working for you.

In [11]: df_s[df_s.time.isin(range(1379945743843,1379945743845))]
Out[11]: 
            time  value
2  1379945743843      3
3  1379945743844      4

Upvotes: 4

Related Questions