B Furtado
B Furtado

Reputation: 1500

Filter a pandas Dataframe based on specific month values and conditional on another column

I have a large dataframe with the following heads

import pandas as pd
f = pd.Dataframe(columns=['month', 'Family_id', 'house_value'])

Months go from 0 to 239, Family_ids up to 10900 and house values vary. So the dataframe has more than 2 and a half million lines.

I want to filter the Dataframe only for those in which there is a difference between the final house price and its initial for each family.

Some sample data would look like this:

f = pd.DataFrame({'month': [0, 0, 0, 0, 0, 1, 1, 239, 239], 'family_id': [0, 1, 2, 3, 4, 0, 1, 0, 1], 'house_value': [10, 10, 5, 7, 8, 10, 11, 10, 11]})

And from that sample, the resulting dataframe would be:

g = pd.DataFrame({'month': [0, 1, 239], 'family_id': [1, 1, 1], 'house_value': [10, 11, 11]})

So I thought in a code that would be something like this:

ft = f[f.loc['month'==239, 'house_value'] > f.loc['month'==0, 'house_value']]

Also tried this:

g = f[f.house_value[f.month==239] > f.house_value[f.month==0] and f.family_id[f.month==239] == f.family_id[f.month==0]]

And the above code gives an error Keyerror: False and ValueError Any ideas. Thanks.

Upvotes: 1

Views: 421

Answers (1)

akuiper
akuiper

Reputation: 214957

Use groupby.filter:

(f.sort_values('month')
  .groupby('family_id')
  .filter(lambda g: g.house_value.iat[-1] != g.house_value.iat[0]))

#  family_id    house_value   month
#1         1             10       0
#6         1             11       1
#8         1             11     239

As commented by @Bharath, your approach errors out because for boolean filter, it expects the boolean series to have the same length as the original data frame, which is not true in both of your cases due to the filter process you applied before the comparison.

Upvotes: 4

Related Questions