Reputation: 1500
I have a large dataframe with the following heads
import pandas as pd
f = pd.Dataframe(columns=['month', 'Family_id', 'house_value'])
Months go from 0 to 239, Family_ids up to 10900 and house values vary. So the dataframe has more than 2 and a half million lines.
I want to filter the Dataframe only for those in which there is a difference between the final house price and its initial for each family.
Some sample data would look like this:
f = pd.DataFrame({'month': [0, 0, 0, 0, 0, 1, 1, 239, 239], 'family_id': [0, 1, 2, 3, 4, 0, 1, 0, 1], 'house_value': [10, 10, 5, 7, 8, 10, 11, 10, 11]})
And from that sample, the resulting dataframe would be:
g = pd.DataFrame({'month': [0, 1, 239], 'family_id': [1, 1, 1], 'house_value': [10, 11, 11]})
So I thought in a code that would be something like this:
ft = f[f.loc['month'==239, 'house_value'] > f.loc['month'==0, 'house_value']]
Also tried this:
g = f[f.house_value[f.month==239] > f.house_value[f.month==0] and f.family_id[f.month==239] == f.family_id[f.month==0]]
And the above code gives an error Keyerror: False
and ValueError
Any ideas. Thanks.
Upvotes: 1
Views: 421
Reputation: 214957
Use groupby.filter
:
(f.sort_values('month')
.groupby('family_id')
.filter(lambda g: g.house_value.iat[-1] != g.house_value.iat[0]))
# family_id house_value month
#1 1 10 0
#6 1 11 1
#8 1 11 239
As commented by @Bharath, your approach errors out because for boolean filter, it expects the boolean series to have the same length as the original data frame, which is not true in both of your cases due to the filter process you applied before the comparison.
Upvotes: 4