TimH
TimH

Reputation: 21

Performing a excel-like countifs in Pandas

I have a dataset that lists employees and timestamps on which they complete certain actions. It is divided into three columns: Employee, Date, Hour.

I want to count the number of employees that is active each hour. In excel I would do this by adding a fourth column EmpFactor in which I perform a COUNTIFS operation:

=1/COUNTIFS(Name range;Name;Date range;Date;Hour range;Hour)

I can subsequently calculate the number of active employees by performing a SUMIF on the EmpFactor Column.

I tried the following code to compose the EmpFactor column using pandas:

for name,date,time in zip(df['Employee'],df['Date'],df['Time']):
    df['EmpFactor'] = 1/(df[(df.Employee == name) and (df.Day == dag) 
                             and (df.Time == tijd)].count())

This however doesn't work. I have searched extensively through numerous topics on SO, but haven't found a fitting answer yet.

Upvotes: 2

Views: 10157

Answers (2)

Mike Müller
Mike Müller

Reputation: 85442

Starting with this dataframe:

df = pd.DataFrame({'Employee': list('ABCDEFGH'), 
                   'Date': [1, 1, 1, 2, 2, 2, 3, 3],
                   'Time': [10, 10, 10, 11, 10, 11, 11, 12]})
print(df)

Output:

   Date Employee  Time
0     1        A    10
1     1        B    10
2     1        C    10
3     2        D    11
4     2        E    10
5     2        F    11
6     3        G    11
7     3        H    12

You can group by Date and Time and count the employees:

per_hour = df.groupby(['Date', 'Time']).count()
per_hour['EmpFactor'] = 1 / per_hour.Employee
print(per_hour)

Output:

           Employee  EmpFactor
Date Time                     
1    10           3   0.333333
2    10           1   1.000000
     11           2   0.500000
3    11           1   1.000000
     12           1   1.000000

Upvotes: 2

Eugene Lisitsky
Eugene Lisitsky

Reputation: 12845

Suppose you have such structure of DataFrame:

import pandas as pd
import numpy as np
df = pd.DataFrame([['Alice', '2012-03-05', 23], 
                   ['Fred',  '2012-03-05', 23], 
                   ['Bob',   '2012-12-12', 00]], 
                  columns=('Employee', 'Date', 'Time'))

# Here you have:
    Employee        Date  Time
0      Alice  2012-03-05    23
1       Fred  2012-03-05    23
2        Bob  2012-12-12     0

# convert to a date
df['DateTime']=pd.to_datetime(df['Date'])
# make it index
df2=df.set_index('DateTime')
# group by date and time
g = df2.groupby([pd.TimeGrouper('D'), 'Time'])
# get counts:
print(g.count())

#Here you have:
                     Employee  Date
DateTime      Time
2012-03-05     23           2     2
2012-12-12      0           1     1


# to get inverted values:
print(1/g.count())

                   Employee  Date
DateTime     Time
2012-03-05   23         0.5   0.5
2012-12-12   0          1.0   1.0

Of course, better to make Time part of the DateTime column. You may practice on it if you wish :)

This approach is quite fast: it took about 3 mininute for grouping 47M rows on my laptop.

Upvotes: 1

Related Questions