Pandas: Subtracting two Series with specific index matching

Question

I have a huge dataset where I need to crunch a lot of numbers and looking 1. for an actual solution and 2. for a fast one.

I tried to simplify and transfer my problem to a real world example as good as possible, I hope it's clear. I'm pretty sure (at least I hope) it is somehow a common problem in Pandas and not a very special one.

So let's say I have two workers in a company called foo and bar. Sometimes they work on the same day and doing the same task over and over again. I measure the time they need to do the task (sometimes only once a day, sometimes multiple times).

What I'm looking for now, is the difference between the shortest times if both of them worked on the same day.

I try to do the wholemeal approach, so dealing with the table (and not iterating through) as much as possible.

My current strategy is: group the data by day and name, keep only the shortest time per day and name, if the group size is 2 (which means I have data for both workers on the same day) subtract both times.

Final goal: having a Series of the differences of the shortest times.

However, I failed to do the grouping and filtering, so now I try to create two series for both workers and then calculate the time differences.

Here is an example dataset:

from StringIO import StringIO
import pandas as pd

raw_data="""day name time
1 foo 10
1 foo 9
1 bar 4
2 foo 12
2 foo 13
3 bar 3
3 bar 5
5 foo 8
5 bar 5
5 foo 9
5 bar 1
"""

df = pd.read_csv(StringIO(raw_data), sep=' ')

grouped_by_day_and_name = df.groupby(['day', 'name'])

This is how the table looks like after grouping and keeping only the shortest times:

print grouped_by_day_and_name.agg({'time': min})

          time
day name      
1   bar      4
    foo      9
2   foo     12
3   bar      3
5   bar      1
    foo      8

Now I'm only interested on day 1 and 5, since those are the only days I have data for both bar and foo. So I'd finish if I somehow could filter the data and subtract both times in each group, so the result would be [-5, -7] (from day 1: 4-9, day 5 1-8).

Since I was not able to filter and subtract, I'm trying to create a series for both names and subtract each one, however the indexes don't match:

foo_best_times = df[df.name == 'foo'].groupby(['day', 'name']).agg({'time': min})
bar_best_times = df[df.name == 'bar'].groupby(['day', 'name']).agg({'time': min})

After trying to subtract each one:

print foo_best_times - bar_best_times

          time
day name      
1   bar    NaN
    foo    NaN
2   foo    NaN
3   bar    NaN
5   bar    NaN
    foo    NaN

What I aimed for was something like this:

day time
1   -5
2   NaN
3   NaN
5   -7

How would I subtract both series by matching only the day as index?

Is this even the right approach to do it fast?

jezrael · Accepted Answer

I think you can use pivot_table with aggfunc=min and then substract columns bar and foo:

from StringIO import StringIO 
import pandas as pd

raw_data="""day name time
1 foo 10
1 foo 9
1 bar 4
2 foo 12
2 foo 13
3 bar 3
3 bar 5
5 foo 8
5 bar 5
5 foo 9
5 bar 1
"""

df = pd.read_csv(StringIO(raw_data), sep=' ')
print df
    day name  time
0     1  foo    10
1     1  foo     9
2     1  bar     4
3     2  foo    12
4     2  foo    13
5     3  bar     3
6     3  bar     5
7     5  foo     8
8     5  bar     5
9     5  foo     9
10    5  bar     1

df = df.pivot_table(index='day', columns='name', values='time', aggfunc=min)

print df
name  bar  foo
day           
1       4    9
2     NaN   12
3       3  NaN
5       1    8


print df['bar'] - df['foo']
1    -5
2   NaN
3   NaN
5    -7
dtype: float64

Pandas: Subtracting two Series with specific index matching

Answers (2)

Related Questions