Windy71
Windy71

Reputation: 909

How to index a dataframe by a column's values between two limits

I want to be able to make a smaller subset of a dataframe by indexing a dataframe by the required values in one column of the dataframe.

Code

import pandas as pd 
import numpy as np

data = [['Alex',15,4],['Bob',5,1],['Clarke',13,2],['dan',6,2],['eve',19,1],['fin',12,1],['ginny',11,2],['hal',14,1],['ian',13,3],['jen',9,1] ]
df = pd.DataFrame(data,columns=['Name','Age','Pets'])
print (df)

lo = 10
hi = 14
lo_range = df[df['Age']>=lo]
print('lo_range:', lo_range)
mid_range = df[hi>= df['Age']>=lo]
print('mid_range:', mid_range)

I found out how to do it:

mid_range = df[( df['Age']>=lo) & (df['Age']>=hi)]

Upvotes: 1

Views: 130

Answers (2)

8783645
8783645

Reputation: 64

You can use .drop() and booleans:

import pandas as pd 

data = [['Alex',15,4],['Bob',5,1],['Clarke',13,2],['dan',6,2],['eve',19,1],['fin',12,1],['ginny',11,2],['hal',14,1],['ian',13,3],['jen',9,1] ]
df = pd.DataFrame(data,columns=['Name','Age','Pets'])
print (df)

lo = 10
hi = 14

lo_range = df.drop(df[(df["Age"] >= lo)].index)
print('lo_range:', lo_range)

mid_range = df.drop(df[(df["Age"] >= hi) | (df["Age"] < lo)].index)
print('mid_range:', mid_range)

high_range = df.drop(df[(df["Age"] < hi)].index)
print('high_range:', high_range)

When you print the mid_range, you then get:

print('mid_range:', mid_range)
mid_range:      Name  Age  Pets
2  Clarke   13     2
5     fin   12     1
6   ginny   11     2
8     ian   13     3

EDIT: your solution given above (mid_range = df[( df['Age']>=lo) & (df['Age']>=hi)]) returns

mid_range:    Name  Age  Pets
0  Alex   15     4
4   eve   19     1
7   hal   14     1

Which I guess is the high range, therefore not working.

Upvotes: 1

Bruck1701
Bruck1701

Reputation: 329

Another way to do it is to use apply with lambda:

mid_range = df[df['Age'].apply(lambda x: x in range(lo,hi+1))]

When measuring the execution time of using apply with lambda and the & operator, I noticed that apply lambda is a bit faster!

start_time = time.time()
mid_range = df[df['Age'].apply(lambda x: x in range(lo,hi+1))]
end_time = time.time()

'mid_range:',      Name  Age  Pets
2  Clarke   13     2
5     fin   12     1
6   ginny   11     2
7     hal   14     1
8     ian   13     3

"execution time(sec): 0.0006139278411865234"


start_time = time.time()
mid_range =  df[( df['Age']>=lo) & (df['Age']<=hi)]
end_time = time.time()

'mid_range:',      Name  Age  Pets
2  Clarke   13     2
5     fin   12     1
6   ginny   11     2
7     hal   14     1
8     ian   13     3

"execution time(sec): 0.0015518665313720703"

So it might not matter if you have only a few entries in your dataframe, but if you happen to have a few million entries in your table, it can really make a difference!

Upvotes: 1

Related Questions