yanadm
yanadm

Reputation: 707

Calculations on several dataframes in pandas

I have several dataframes. Below is an example of each of them.

        df_min
scale   code    R1  R2 ...  
  1      121    50  30
  2      121    35  45
  3      121    40  50
  4      121    20  30
  5      121    20  35
  1      313    10   7
  2      313    13  10
  3      313    10  12
  4      313    15  8
  5      313    17  10
...

        df_rate
scale   code    R1  R2 ...
  1      121    20  40
  2      121    30  20
  3      121    20  30
  4      121    15  40
  5      121    10  30
  1      313    10   5
  2      313    15  10
  3      313    25  10
  4      313    10  15
  5      313    20   5
...

        df_max
scale   code    R1  R2 ...
  1      121    30  200
  2      121    100 175
  3      121    70  100
  4      121    80  90
  5      121    75  35
  1      313    60  70
  2      313    35  70
  3      313    50  60
  4      313    50  45
  5      313    45  68
...

     df_stock
code    R1  R2 ...
 121    100 150
 313    70  65
....

        df_new
scale   code     R1  R2 ...
  1      121    NaN NaN
  2      121    NaN NaN
  3      121    NaN NaN
  4      121    NaN NaN
  5      121    NaN NaN
  1      313    NaN NaN
  2      313    NaN NaN
  3      313    NaN NaN
  4      313    NaN NaN
  5      313    NaN NaN
...

Columns like R1 and R2 are cities and there can be many. The code column has unique product codes. They can also be many. The scale marks the number of the week for which I am making calculations. For each code, it's five weeks. Also I have an empty dataframe, in which I need to record the results of calculations.

I will give an example of a specific calculation, I hope it will be clearer. In an empty dataframe I take a cell where the scale is 1 and the code is 121 and columnn R1. I find the corresponding value in df_stock. It is equal to 100. from it I subtract the value from df_rate from the column R1, where the scale is 1. If this value is greater than the value from df_min from the column R1, where the scale is 1, I write it down, if less, then I take the value from df_max from column R1, where the scale is 1. It turned out 80, it is more than 50, so I write it down. For the next cell, I subtract from the calculated value(80) next value in the column R1, where the scale is 2. And I also check it. It turned out 50. I write it down. And so on. From the table df_stock I take the first value, from which I begin to subtract.

In fact, df_min is the minimum value and if I go below it, then I need to copy the corresponding value from df_max.

These calculations must be made for each unique code (each code is represented in the scale range from 1 to 5) and for each column like R1 and R2.

The result of the example is as follows:

scale   code    R1  R2 ...
  1      121    80  110
  2      121    50  90
  3      121    70  60
  4      121    55  90
  5      121    45  60
  1      313    60  60
  2      313    45  50
  3      313    20  40
  4      313    50  25
  5      313    30  20
...

I will be very grateful for any help!

UPD I wrote a script that does what I need. It is not optimal, but I have no other ideas. Is it possible to change it and add to it a loop? In my original data scale is between 1 and 26 and with my code I have to write a condition for each value from the scale.

My code:

import pandas as pd
import numpy as np

a = (1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
b = (121,121,121,121,121,313,313,313,313,313,444,444,444,444,444)

columns = ['scale', 'code', 'R1', 'R2', 'R3']
index = np.arange(15)

df_min = pd.DataFrame(columns=columns, index = index)
df_min['scale'] = a
df_min['code'] = b
df_min['R1'] = np.random.randint(10, 50, size=15)
df_min['R2'] = np.random.randint(10, 50, size=15)
df_min['R3'] = np.random.randint(10, 50, size=15)

df_rate = pd.DataFrame(columns=columns, index = index)
df_rate['scale'] = a
df_rate['code'] = b
df_rate['R1'] = np.random.randint(5, 40, size=15)
df_rate['R2'] = np.random.randint(5, 40, size=15)
df_rate['R3'] = np.random.randint(5, 40, size=15)

df_max = pd.DataFrame(columns=columns, index = index)
df_max['scale'] = a
df_max['code'] = b
df_max['R1'] = np.random.randint(50, 150, size=15)
df_max['R2'] = np.random.randint(50, 150, size=15)
df_max['R3'] = np.random.randint(50, 150, size=15)

index1 = np.arange(3)
df_stock = pd.DataFrame(columns=columns, index = index1)
df_stock['code'] = (121,313,444)
df_stock['R1'] = np.random.randint(100, 300, size=3)
df_stock['R2'] = np.random.randint(100, 300, size=3)
df_stock['R3'] = np.random.randint(100, 300, size=3)

df_new = pd.DataFrame(columns=columns, index = index)
df_new['scale'] = a
df_new['code'] = b

# set the index to 'code' to subtract df_rate from df_stock
df_stock = df_stock.set_index('code')
df_rate = df_rate.set_index('code')
df_new = df_stock - df_rate
# have to add back in the 'scale' column since it wasn't present in df_rate
df_new['scale'] = df_rate['scale']
# now set the index to use both 'code' and 'scale'
df_new = df_new.reset_index()
df_new = df_new.set_index(['code', 'scale'])
df_min = df_min.set_index(['code', 'scale'])
df_max = df_max.set_index(['code', 'scale'])
df_new = df_new.mask(df_new < df_min, df_max)

df_min = df_min.reset_index()
df_min.insert(2, 'test', 0)
df_max = df_max.reset_index()
df_max.insert(2, 'test', 0)
df_new = df_new.reset_index()
df_new.insert(2, 'test', 0)
df_rate = df_rate.reset_index()
df_rate.insert(2, 'test', 0)

df_new.loc[df_new['scale'].between(2,5), 'test':] = np.nan

df_rate_p = df_rate.loc[df_rate['scale'] == 2, :'scale']
df_new.index +=1
df_rate_p1 = df_new.loc[df_new['scale'] == 1, 'test':] - df_rate.loc[df_rate['scale'] == 2, 'test':]
df_new2 = pd.concat([df_rate_p, df_rate_p1], axis=1)
df_new = df_new.set_index(['code', 'scale']).fillna(df_new2.set_index(['code', 'scale'])).reset_index()
df_new = df_new.mask(df_new < df_min, df_max)

df_rate_p = df_rate.loc[df_rate['scale'] == 3, :'scale']
df_new.index +=1
df_rate_p1 = df_new.loc[df_new['scale'] == 2, 'test':] - df_rate.loc[df_rate['scale'] == 3, 'test':]
df_new2 = pd.concat([df_rate_p, df_rate_p1], axis=1)
df_new = df_new.set_index(['code', 'scale']).fillna(df_new2.set_index(['code', 'scale'])).reset_index()
df_new = df_new.mask(df_new < df_min, df_max)

df_rate_p = df_rate.loc[df_rate['scale'] == 4, :'scale']
df_new.index +=1
df_rate_p1 = df_new.loc[df_new['scale'] == 3, 'test':] - df_rate.loc[df_rate['scale'] == 4, 'test':]
df_new2 = pd.concat([df_rate_p, df_rate_p1], axis=1)
df_new = df_new.set_index(['code', 'scale']).fillna(df_new2.set_index(['code', 'scale'])).reset_index()
df_new = df_new.mask(df_new < df_min, df_max)

df_rate_p = df_rate.loc[df_rate['scale'] == 5, :'scale']
df_new.index +=1
df_rate_p1 = df_new.loc[df_new['scale'] == 4, 'test':] - df_rate.loc[df_rate['scale'] == 5, 'test':]
df_new2 = pd.concat([df_rate_p, df_rate_p1], axis=1)
df_new = df_new.set_index(['code', 'scale']).fillna(df_new2.set_index(['code', 'scale'])).reset_index()
df_new = df_new.mask(df_new < df_min, df_max)
df_new

Upvotes: 0

Views: 80

Answers (1)

Hazzles
Hazzles

Reputation: 476

using pandas index and MultiIndex is very useful for comparing the correct rows with eachother.

Here is how you would use it:

# set the index to 'code' to subtract df_rate from df_stock
df_stock = df_stock.set_index('code')
df_rate = df_rate.set_index('code')
df_new = df_stock - df_rate
# have to add back in the 'scale' column since it wasn't present in df_rate
df_new['scale'] = df_rate['scale']

# now set the index to use both 'code' and 'scale'
df_new = df_new.reset_index()
df_new = df_new.set_index(['code', 'scale'])
df_min = df_min.set_index(['code', 'scale'])
df_max = df_max.set_index(['code', 'scale'])

# you may not actually need these lines, but sometimes it is necessary!
# intersection = df_new.index.intersection(df_min.index).intersection(df_max.index)
# df_new = df_new.loc[intersection]
# df_min = df_min.loc[intersection]
# df_max = df_max.loc[intersection]

# if df_new < df_max, then use values from df_max
# I didn't actually understand what you meant to do with the
# the data, so this is probably not quite what you intended,
# but you can can use this to see how it works and implement
# your algorithm
df_new = df_new.mask(df_new < df_min, df_max)

Upvotes: 1

Related Questions