user820304
user820304

Reputation:

Combining multiple rows and calculating values

Consider the following MVCE:

import pandas as pd

data_in = [
  { 'company': 'A', 'time': 1552521600, 'interval': 'hour', 'violations': 0, 'handled': 400 },
  { 'company': 'A', 'time': 1552525200, 'interval': 'hour', 'violations': 2, 'handled': 300 },
  { 'company': 'A', 'time': 1552528800, 'interval': 'hour', 'violations': 0, 'handled': 20 },
  { 'company': 'A', 'time': 1552521600, 'interval': 'day', 'violations': 3, 'handled': 1800 },

  { 'company': 'B', 'time': 1552521600, 'interval': 'hour', 'violations': 1, 'handled': 200 },
  { 'company': 'B', 'time': 1552525200, 'interval': 'hour', 'violations': 1, 'handled': 200 },
  { 'company': 'B', 'time': 1552528800, 'interval': 'hour', 'violations': 2, 'handled': 400 },
  { 'company': 'B', 'time': 1552521600, 'interval': 'day', 'violations': 4, 'handled': 1400 },
]

df = pd.DataFrame(data_in).set_index(['company', 'time'])
df.sort_index(axis=0, inplace=True)

What I'm wanting to accomplish here is, for each company, to compare the number of violations in a given hour to the daily total violations, as well as the number of violations relative to the number of units handled.

One step that appears to get me closer is using a pivot table:

pv = df.pivot_table(index=['company', 'time'], columns='interval')
print(df)

results in

                   handled        violations
interval               day   hour        day hour
company time
A       1552521600  1800.0  400.0        3.0  0.0
        1552525200     NaN  300.0        NaN  2.0
        1552528800     NaN   20.0        NaN  0.0
B       1552521600  1400.0  200.0        4.0  1.0
        1552525200     NaN  200.0        NaN  1.0
        1552528800     NaN  400.0        NaN  2.0

I believe what I'm trying to get is

                   handled        violations
interval               day   hour        day hour
company time
A       1552521600  1800.0  400.0        3.0  0.0
        1552525200  1800.0  300.0        3.0  2.0
        1552528800  1800.0   20.0        3.0  0.0
B       1552521600  1400.0  200.0        4.0  1.0
        1552525200  1400.0  200.0        4.0  1.0
        1552528800  1400.0  400.0        4.0  2.0

The calculations I'm trying to do will be something along the line:

pv['hv_ratio'] = pv['handled.hour'] / pv['violations.hour']
pv['v_ratio'] = pv['violations.hour'] / pv['violations.day']

I have tried a few things over the past few hours, but come up dry on how to approach this problem.

How can I accomplish this?

Upvotes: 0

Views: 38

Answers (1)

cullzie
cullzie

Reputation: 2755

I think you just need to use ffill and then calculate your hv_ratio and v_ratio columns as below:

pv.ffill(inplace=True)
pv['hv_ratio'] = pv['handled']['hour'] / pv['violations']['hour']
pv['v_ratio'] = pv['violations']['hour'] / pv['violations']['day']

Upvotes: 1

Related Questions