Mario W
Mario W

Reputation: 17

Is it possible to create a for loop for a pandas df with 2 variables?

I have a pandas dataframe that contains weight (weight column) information based on different users (user_Id column) and dates (date column/pandas data object).

I would like to calculate the weight difference between the earliest and latest measurement for all users.

To calculate the earliest and latest measurement, I use the following functions:

earliest_date = []
latest_date = []
for x in Id_list:
    a = weight_info[weight_info['Id']==x]
    earliest_date.append(a['date'].min())
    latest_date.append(a['date'].max())

Then I want to create a for loop in order to pass in date and earliest date to get the weight information, something like:

df = weight_info[(weight_info['date']==x) & (weight_info['Id']==y)]
df['weight']

But I am not sure how to do this with a for loop based on two variables. Or is there any easier way to run the whole calculation?

Upvotes: -1

Views: 148

Answers (2)

simon
simon

Reputation: 690

Get min/max dates per user with groupby

min_dates = weight_info.groupby('Id').agg({'min':'date'})
max_dates = weight_info.groupby('Id').agg({'max':'date'})

Then join with the weights to get the weight for the min/max date per user

min_weights = weight_info.merge( min_dates[['Id', 'date']], 
                                 on = ['Id', 'date'], how='inner' )

max_weights = weight_info.merge( max_dates[['Id', 'date']], 
                                 on = ['Id', 'date'], how='inner' )

Finally, subtract both for the same customer

Upvotes: 0

user12217470
user12217470

Reputation:

You could try using 'pandasql'. This library allows you to manipulate data in a Pandas data frame, using SQL code. I have found it useful for manipulating data frames from random csv files.

import pandasql as psql

df = 'Your_pandas_df'

# Shows the record counts in your dataset
record_count = psql.sqldf('''
SELECT
COUNT(*) as record_count
FROM df''')

Upvotes: 0

Related Questions