Reputation: 909
I want to be able to make a smaller subset of a dataframe by indexing a dataframe by the required values in one column of the dataframe.
Code
import pandas as pd
import numpy as np
data = [['Alex',15,4],['Bob',5,1],['Clarke',13,2],['dan',6,2],['eve',19,1],['fin',12,1],['ginny',11,2],['hal',14,1],['ian',13,3],['jen',9,1] ]
df = pd.DataFrame(data,columns=['Name','Age','Pets'])
print (df)
lo = 10
hi = 14
lo_range = df[df['Age']>=lo]
print('lo_range:', lo_range)
mid_range = df[hi>= df['Age']>=lo]
print('mid_range:', mid_range)
I found out how to do it:
mid_range = df[( df['Age']>=lo) & (df['Age']>=hi)]
Upvotes: 1
Views: 130
Reputation: 64
You can use .drop() and booleans:
import pandas as pd
data = [['Alex',15,4],['Bob',5,1],['Clarke',13,2],['dan',6,2],['eve',19,1],['fin',12,1],['ginny',11,2],['hal',14,1],['ian',13,3],['jen',9,1] ]
df = pd.DataFrame(data,columns=['Name','Age','Pets'])
print (df)
lo = 10
hi = 14
lo_range = df.drop(df[(df["Age"] >= lo)].index)
print('lo_range:', lo_range)
mid_range = df.drop(df[(df["Age"] >= hi) | (df["Age"] < lo)].index)
print('mid_range:', mid_range)
high_range = df.drop(df[(df["Age"] < hi)].index)
print('high_range:', high_range)
When you print the mid_range, you then get:
print('mid_range:', mid_range)
mid_range: Name Age Pets
2 Clarke 13 2
5 fin 12 1
6 ginny 11 2
8 ian 13 3
EDIT: your solution given above (mid_range = df[( df['Age']>=lo) & (df['Age']>=hi)]
) returns
mid_range: Name Age Pets
0 Alex 15 4
4 eve 19 1
7 hal 14 1
Which I guess is the high range, therefore not working.
Upvotes: 1
Reputation: 329
Another way to do it is to use apply with lambda:
mid_range = df[df['Age'].apply(lambda x: x in range(lo,hi+1))]
When measuring the execution time of using apply with lambda and the &
operator,
I noticed that apply lambda is a bit faster!
start_time = time.time()
mid_range = df[df['Age'].apply(lambda x: x in range(lo,hi+1))]
end_time = time.time()
'mid_range:', Name Age Pets
2 Clarke 13 2
5 fin 12 1
6 ginny 11 2
7 hal 14 1
8 ian 13 3
"execution time(sec): 0.0006139278411865234"
start_time = time.time()
mid_range = df[( df['Age']>=lo) & (df['Age']<=hi)]
end_time = time.time()
'mid_range:', Name Age Pets
2 Clarke 13 2
5 fin 12 1
6 ginny 11 2
7 hal 14 1
8 ian 13 3
"execution time(sec): 0.0015518665313720703"
So it might not matter if you have only a few entries in your dataframe, but if you happen to have a few million entries in your table, it can really make a difference!
Upvotes: 1