Reputation:
I have the following MVCE:
import pandas as pd
data_in = [
{ 'foo': 'company A', 'bar': 'division 1', 'time': 1, 'diff': 0.99 },
{ 'foo': 'company A', 'bar': 'division 1', 'time': 2, 'diff': 0.95 },
{ 'foo': 'company A', 'bar': 'division 1', 'time': 3, 'diff': 0.94 },
{ 'foo': 'company A', 'bar': 'division 1', 'time': 4, 'diff': 0.90 },
{ 'foo': 'company A', 'bar': 'division 1', 'time': 5, 'diff': 1.01 },
{ 'foo': 'company A', 'bar': 'division 2', 'time': 1, 'diff': 0.91 },
{ 'foo': 'company A', 'bar': 'division 2', 'time': 2, 'diff': 0.92 },
{ 'foo': 'company A', 'bar': 'division 2', 'time': 3, 'diff': 0.93 },
{ 'foo': 'company A', 'bar': 'division 2', 'time': 4, 'diff': 0.94 },
{ 'foo': 'company A', 'bar': 'division 2', 'time': 5, 'diff': 0.95 },
{ 'foo': 'company B', 'bar': 'division 1', 'time': 1, 'diff': 1.01 },
{ 'foo': 'company B', 'bar': 'division 1', 'time': 2, 'diff': 1.08 },
{ 'foo': 'company B', 'bar': 'division 1', 'time': 3, 'diff': 1.21 },
{ 'foo': 'company B', 'bar': 'division 1', 'time': 4, 'diff': 1.22 },
{ 'foo': 'company B', 'bar': 'division 1', 'time': 5, 'diff': 1.18 },
{ 'foo': 'company B', 'bar': 'division 2', 'time': 1, 'diff': 0.81 },
{ 'foo': 'company B', 'bar': 'division 2', 'time': 2, 'diff': 0.82 },
{ 'foo': 'company B', 'bar': 'division 2', 'time': 3, 'diff': 0.88 },
{ 'foo': 'company B', 'bar': 'division 2', 'time': 4, 'diff': 0.87 },
{ 'foo': 'company B', 'bar': 'division 2', 'time': 5, 'diff': 0.87 },
]
df = pd.DataFrame(data_in).set_index(['foo', 'bar', 'time'])
df.sort_index(axis=0, inplace=True)
data_out = []
for name, group in df.groupby(['foo', 'time']):
print(group)
# example output
# foo bar time
# company B division 1 5 1.18
# division 2 5 0.87
result = '?'
data_out.append({ 'foo': name[0], 'time': name[1], 'result': result })
print('out', data_out)
Essentially, I am trying to calculate some result for each company by comparing two divisions, for each time we recorded diff
.
For the example, I am trying to get a truthy result for when "division 1" performed above a target metric, and when "division 2" performed under the standard target.
One possible solution I've found would be
for name, group in df.groupby(['foo', 'time']):
group = group.reset_index()
group.loc[(group['bar'] =="division 1") & (group['diff'] > 1.04), 'result'] = True
group.loc[(group['bar'] =="division 2") & (group['diff'] < 1), 'result'] = True
group['result'] = group['result'].fillna(False)
result = group['result'].all(skipna=False)
However there could be hundreds of data points for each result, and I feel as though this solution will quickly bloat up the dataframe with hundreds or thousands of additional columns.
I may need to directly compare the diff
between two rows (ie, if "division 1" is less than "division 2"), which I cannot figure out how to do with the above solution.
Due to the amount of data being processed, I'm concerned primarily about speed but also want to avoid any unnecessary memory usage.
What is the best approach to these type of calculations?
Upvotes: 1
Views: 49
Reputation: 149075
Pivoting could be a nice approach here:
df.pivot_table(index=['foo', 'time'], columns='bar', values='diff')
gives:
bar division 1 division 2
foo time
company A 1 0.99 0.91
2 0.95 0.92
3 0.94 0.93
4 0.90 0.94
5 1.01 0.95
company B 1 1.01 0.81
2 1.08 0.82
3 1.21 0.88
4 1.22 0.87
5 1.18 0.87
You now have the values for the divisions in columns which will make comparisons easier.
I could also try unstack
. It is less versatile but far more efficient when it can be used. In my tests, it is more than 7 times faster:
df.unstack(1)
diff
bar division 1 division 2
foo time
company A 1 0.99 0.91
2 0.95 0.92
3 0.94 0.93
4 0.90 0.94
5 1.01 0.95
company B 1 1.01 0.81
2 1.08 0.82
3 1.21 0.88
4 1.22 0.87
5 1.18 0.87
Upvotes: 1