Reputation: 17
I have a pandas dataframe that contains weight (weight column) information based on different users (user_Id column) and dates (date column/pandas data object).
I would like to calculate the weight difference between the earliest and latest measurement for all users.
To calculate the earliest and latest measurement, I use the following functions:
earliest_date = []
latest_date = []
for x in Id_list:
a = weight_info[weight_info['Id']==x]
earliest_date.append(a['date'].min())
latest_date.append(a['date'].max())
Then I want to create a for loop in order to pass in date and earliest date to get the weight information, something like:
df = weight_info[(weight_info['date']==x) & (weight_info['Id']==y)]
df['weight']
But I am not sure how to do this with a for loop based on two variables. Or is there any easier way to run the whole calculation?
Upvotes: -1
Views: 148
Reputation: 690
Get min/max dates per user with groupby
min_dates = weight_info.groupby('Id').agg({'min':'date'})
max_dates = weight_info.groupby('Id').agg({'max':'date'})
Then join with the weights to get the weight for the min/max date per user
min_weights = weight_info.merge( min_dates[['Id', 'date']],
on = ['Id', 'date'], how='inner' )
max_weights = weight_info.merge( max_dates[['Id', 'date']],
on = ['Id', 'date'], how='inner' )
Finally, subtract both for the same customer
Upvotes: 0
Reputation:
You could try using 'pandasql'. This library allows you to manipulate data in a Pandas data frame, using SQL code. I have found it useful for manipulating data frames from random csv files.
import pandasql as psql
df = 'Your_pandas_df'
# Shows the record counts in your dataset
record_count = psql.sqldf('''
SELECT
COUNT(*) as record_count
FROM df''')
Upvotes: 0