user820304
user820304

Reputation:

calculating result across multiple unique rows

I have the following MVCE:

import pandas as pd

data_in = [
  { 'foo': 'company A', 'bar': 'division 1', 'time': 1, 'diff': 0.99 },
  { 'foo': 'company A', 'bar': 'division 1', 'time': 2, 'diff': 0.95 },
  { 'foo': 'company A', 'bar': 'division 1', 'time': 3, 'diff': 0.94 },
  { 'foo': 'company A', 'bar': 'division 1', 'time': 4, 'diff': 0.90 },
  { 'foo': 'company A', 'bar': 'division 1', 'time': 5, 'diff': 1.01 },

  { 'foo': 'company A', 'bar': 'division 2', 'time': 1, 'diff': 0.91 },
  { 'foo': 'company A', 'bar': 'division 2', 'time': 2, 'diff': 0.92 },
  { 'foo': 'company A', 'bar': 'division 2', 'time': 3, 'diff': 0.93 },
  { 'foo': 'company A', 'bar': 'division 2', 'time': 4, 'diff': 0.94 },
  { 'foo': 'company A', 'bar': 'division 2', 'time': 5, 'diff': 0.95 },

  { 'foo': 'company B', 'bar': 'division 1', 'time': 1, 'diff': 1.01 },
  { 'foo': 'company B', 'bar': 'division 1', 'time': 2, 'diff': 1.08 },
  { 'foo': 'company B', 'bar': 'division 1', 'time': 3, 'diff': 1.21 },
  { 'foo': 'company B', 'bar': 'division 1', 'time': 4, 'diff': 1.22 },
  { 'foo': 'company B', 'bar': 'division 1', 'time': 5, 'diff': 1.18 },

  { 'foo': 'company B', 'bar': 'division 2', 'time': 1, 'diff': 0.81 },
  { 'foo': 'company B', 'bar': 'division 2', 'time': 2, 'diff': 0.82 },
  { 'foo': 'company B', 'bar': 'division 2', 'time': 3, 'diff': 0.88 },
  { 'foo': 'company B', 'bar': 'division 2', 'time': 4, 'diff': 0.87 },
  { 'foo': 'company B', 'bar': 'division 2', 'time': 5, 'diff': 0.87 },
]

df = pd.DataFrame(data_in).set_index(['foo', 'bar', 'time'])
df.sort_index(axis=0, inplace=True)

data_out = []
for name, group in df.groupby(['foo', 'time']):

  print(group) 
# example output
# foo       bar        time
# company B division 1 5     1.18
#           division 2 5     0.87
  result = '?'
  data_out.append({ 'foo': name[0], 'time': name[1], 'result': result })

print('out', data_out)

Essentially, I am trying to calculate some result for each company by comparing two divisions, for each time we recorded diff.

For the example, I am trying to get a truthy result for when "division 1" performed above a target metric, and when "division 2" performed under the standard target.

One possible solution I've found would be

for name, group in df.groupby(['foo', 'time']):
  group = group.reset_index()

  group.loc[(group['bar'] =="division 1") & (group['diff'] > 1.04), 'result'] = True
  group.loc[(group['bar'] =="division 2") & (group['diff'] < 1), 'result'] = True
  group['result'] = group['result'].fillna(False)
  result = group['result'].all(skipna=False)

However there could be hundreds of data points for each result, and I feel as though this solution will quickly bloat up the dataframe with hundreds or thousands of additional columns.

I may need to directly compare the diff between two rows (ie, if "division 1" is less than "division 2"), which I cannot figure out how to do with the above solution.

Due to the amount of data being processed, I'm concerned primarily about speed but also want to avoid any unnecessary memory usage.

What is the best approach to these type of calculations?

Upvotes: 1

Views: 49

Answers (1)

Serge Ballesta
Serge Ballesta

Reputation: 149075

Pivoting could be a nice approach here:

df.pivot_table(index=['foo', 'time'], columns='bar', values='diff')

gives:

bar             division 1  division 2
foo       time                        
company A 1           0.99        0.91
          2           0.95        0.92
          3           0.94        0.93
          4           0.90        0.94
          5           1.01        0.95
company B 1           1.01        0.81
          2           1.08        0.82
          3           1.21        0.88
          4           1.22        0.87
          5           1.18        0.87

You now have the values for the divisions in columns which will make comparisons easier.


I could also try unstack. It is less versatile but far more efficient when it can be used. In my tests, it is more than 7 times faster:

df.unstack(1)
                     diff           
bar            division 1 division 2
foo       time                      
company A 1          0.99       0.91
          2          0.95       0.92
          3          0.94       0.93
          4          0.90       0.94
          5          1.01       0.95
company B 1          1.01       0.81
          2          1.08       0.82
          3          1.21       0.88
          4          1.22       0.87
          5          1.18       0.87

Upvotes: 1

Related Questions