tamasgal
tamasgal

Reputation: 26329

Pandas: Subtracting two Series with specific index matching

I have a huge dataset where I need to crunch a lot of numbers and looking 1. for an actual solution and 2. for a fast one.

I tried to simplify and transfer my problem to a real world example as good as possible, I hope it's clear. I'm pretty sure (at least I hope) it is somehow a common problem in Pandas and not a very special one.

So let's say I have two workers in a company called foo and bar. Sometimes they work on the same day and doing the same task over and over again. I measure the time they need to do the task (sometimes only once a day, sometimes multiple times).

What I'm looking for now, is the difference between the shortest times if both of them worked on the same day.

I try to do the wholemeal approach, so dealing with the table (and not iterating through) as much as possible.

My current strategy is: group the data by day and name, keep only the shortest time per day and name, if the group size is 2 (which means I have data for both workers on the same day) subtract both times.

Final goal: having a Series of the differences of the shortest times.

However, I failed to do the grouping and filtering, so now I try to create two series for both workers and then calculate the time differences.

Here is an example dataset:

from StringIO import StringIO
import pandas as pd

raw_data="""day name time
1 foo 10
1 foo 9
1 bar 4
2 foo 12
2 foo 13
3 bar 3
3 bar 5
5 foo 8
5 bar 5
5 foo 9
5 bar 1
"""

df = pd.read_csv(StringIO(raw_data), sep=' ')

grouped_by_day_and_name = df.groupby(['day', 'name'])

This is how the table looks like after grouping and keeping only the shortest times:

print grouped_by_day_and_name.agg({'time': min})

          time
day name      
1   bar      4
    foo      9
2   foo     12
3   bar      3
5   bar      1
    foo      8

Now I'm only interested on day 1 and 5, since those are the only days I have data for both bar and foo. So I'd finish if I somehow could filter the data and subtract both times in each group, so the result would be [-5, -7] (from day 1: 4-9, day 5 1-8).

Since I was not able to filter and subtract, I'm trying to create a series for both names and subtract each one, however the indexes don't match:

foo_best_times = df[df.name == 'foo'].groupby(['day', 'name']).agg({'time': min})
bar_best_times = df[df.name == 'bar'].groupby(['day', 'name']).agg({'time': min})

After trying to subtract each one:

print foo_best_times - bar_best_times

          time
day name      
1   bar    NaN
    foo    NaN
2   foo    NaN
3   bar    NaN
5   bar    NaN
    foo    NaN

What I aimed for was something like this:

day time
1   -5
2   NaN
3   NaN
5   -7

How would I subtract both series by matching only the day as index?

Is this even the right approach to do it fast?

Upvotes: 2

Views: 4031

Answers (2)

jezrael
jezrael

Reputation: 863331

I think you can use pivot_table with aggfunc=min and then substract columns bar and foo:

from StringIO import StringIO 
import pandas as pd

raw_data="""day name time
1 foo 10
1 foo 9
1 bar 4
2 foo 12
2 foo 13
3 bar 3
3 bar 5
5 foo 8
5 bar 5
5 foo 9
5 bar 1
"""

df = pd.read_csv(StringIO(raw_data), sep=' ')
print df
    day name  time
0     1  foo    10
1     1  foo     9
2     1  bar     4
3     2  foo    12
4     2  foo    13
5     3  bar     3
6     3  bar     5
7     5  foo     8
8     5  bar     5
9     5  foo     9
10    5  bar     1
df = df.pivot_table(index='day', columns='name', values='time', aggfunc=min)

print df
name  bar  foo
day           
1       4    9
2     NaN   12
3       3  NaN
5       1    8


print df['bar'] - df['foo']
1    -5
2   NaN
3   NaN
5    -7
dtype: float64

Upvotes: 2

Marijn van Vliet
Marijn van Vliet

Reputation: 5409

I think what you want to do is an 'inner' join. This type of join performs the index matching you are looking for:

from StringIO import StringIO
import pandas as pd

raw_data="""day name time
1 foo 10
1 foo 9
1 bar 4
2 foo 12
2 foo 13
3 bar 3
3 bar 5
5 foo 8
5 bar 5
5 foo 9
5 bar 1
"""

df = pd.read_csv(StringIO(raw_data), sep=' ')

# Split the dataset into the two workers
foo = df.query('name == "foo"')
bar = df.query('name == "bar"')

# Find for each day the shortest working time
foo = foo.groupby('day').agg('min')
bar = bar.groupby('day').agg('min')

# Perform an inner join of the two workers, this only keeps days
# where both workers have been working
joined = foo.join(bar, how='inner', lsuffix='_foo', rsuffix='_bar')

# Compute the difference in minimum working times
diff = joined['time_bar'] - joined['time_foo']

print diff

Result:

day
1    -5
5    -7
dtype: int64

If you want NaN's on the days where only one worker was working, you can perform an 'outer' join:

# Perform an outer join of the two workers, this only keeps days
# where both workers have been working
joined = foo.join(bar, how='outer', lsuffix='_foo', rsuffix='_bar')

# Compute the difference in minimum working times
diff = joined['time_bar'] - joined['time_foo']

print diff

Result:

day
1    -5
2   NaN
3   NaN
5    -7
dtype: float64

Upvotes: 1

Related Questions