Reputation: 26329
I have a huge dataset where I need to crunch a lot of numbers and looking 1. for an actual solution and 2. for a fast one.
I tried to simplify and transfer my problem to a real world example as good as possible, I hope it's clear. I'm pretty sure (at least I hope) it is somehow a common problem in Pandas and not a very special one.
So let's say I have two workers in a company called foo
and bar
. Sometimes they work on the same day and doing the same task over and over again. I measure the time they need to do the task (sometimes only once a day, sometimes multiple times).
What I'm looking for now, is the difference between the shortest times if both of them worked on the same day.
I try to do the wholemeal approach, so dealing with the table (and not iterating through) as much as possible.
My current strategy is: group the data by day
and name
, keep only the shortest time
per day
and name
, if the group size is 2 (which means I have data for both workers on the same day) subtract both times.
Final goal: having a Series
of the differences of the shortest times.
However, I failed to do the grouping and filtering, so now I try to create two series for both workers and then calculate the time differences.
Here is an example dataset:
from StringIO import StringIO
import pandas as pd
raw_data="""day name time
1 foo 10
1 foo 9
1 bar 4
2 foo 12
2 foo 13
3 bar 3
3 bar 5
5 foo 8
5 bar 5
5 foo 9
5 bar 1
"""
df = pd.read_csv(StringIO(raw_data), sep=' ')
grouped_by_day_and_name = df.groupby(['day', 'name'])
This is how the table looks like after grouping and keeping only the shortest times:
print grouped_by_day_and_name.agg({'time': min})
time
day name
1 bar 4
foo 9
2 foo 12
3 bar 3
5 bar 1
foo 8
Now I'm only interested on day 1 and 5, since those are the only days I have data for both bar
and foo
. So I'd finish if I somehow could filter the data and subtract both times in each group, so the result would be [-5, -7]
(from day 1: 4-9, day 5 1-8).
Since I was not able to filter and subtract, I'm trying to create a series for both names and subtract each one, however the indexes don't match:
foo_best_times = df[df.name == 'foo'].groupby(['day', 'name']).agg({'time': min})
bar_best_times = df[df.name == 'bar'].groupby(['day', 'name']).agg({'time': min})
After trying to subtract each one:
print foo_best_times - bar_best_times
time
day name
1 bar NaN
foo NaN
2 foo NaN
3 bar NaN
5 bar NaN
foo NaN
What I aimed for was something like this:
day time
1 -5
2 NaN
3 NaN
5 -7
How would I subtract both series by matching only the day
as index?
Is this even the right approach to do it fast?
Upvotes: 2
Views: 4031
Reputation: 863331
I think you can use pivot_table
with aggfunc=min
and then substract columns bar
and foo
:
from StringIO import StringIO
import pandas as pd
raw_data="""day name time
1 foo 10
1 foo 9
1 bar 4
2 foo 12
2 foo 13
3 bar 3
3 bar 5
5 foo 8
5 bar 5
5 foo 9
5 bar 1
"""
df = pd.read_csv(StringIO(raw_data), sep=' ')
print df
day name time
0 1 foo 10
1 1 foo 9
2 1 bar 4
3 2 foo 12
4 2 foo 13
5 3 bar 3
6 3 bar 5
7 5 foo 8
8 5 bar 5
9 5 foo 9
10 5 bar 1
df = df.pivot_table(index='day', columns='name', values='time', aggfunc=min)
print df
name bar foo
day
1 4 9
2 NaN 12
3 3 NaN
5 1 8
print df['bar'] - df['foo']
1 -5
2 NaN
3 NaN
5 -7
dtype: float64
Upvotes: 2
Reputation: 5409
I think what you want to do is an 'inner' join. This type of join performs the index matching you are looking for:
from StringIO import StringIO
import pandas as pd
raw_data="""day name time
1 foo 10
1 foo 9
1 bar 4
2 foo 12
2 foo 13
3 bar 3
3 bar 5
5 foo 8
5 bar 5
5 foo 9
5 bar 1
"""
df = pd.read_csv(StringIO(raw_data), sep=' ')
# Split the dataset into the two workers
foo = df.query('name == "foo"')
bar = df.query('name == "bar"')
# Find for each day the shortest working time
foo = foo.groupby('day').agg('min')
bar = bar.groupby('day').agg('min')
# Perform an inner join of the two workers, this only keeps days
# where both workers have been working
joined = foo.join(bar, how='inner', lsuffix='_foo', rsuffix='_bar')
# Compute the difference in minimum working times
diff = joined['time_bar'] - joined['time_foo']
print diff
Result:
day
1 -5
5 -7
dtype: int64
If you want NaN
's on the days where only one worker was working, you can perform an 'outer' join:
# Perform an outer join of the two workers, this only keeps days
# where both workers have been working
joined = foo.join(bar, how='outer', lsuffix='_foo', rsuffix='_bar')
# Compute the difference in minimum working times
diff = joined['time_bar'] - joined['time_foo']
print diff
Result:
day
1 -5
2 NaN
3 NaN
5 -7
dtype: float64
Upvotes: 1