Python/Pandas: Grouping and counting records by date and ID

Question

I have a relatively large data frame in Python (~10^6 records), structured as so:

Index,Date,City,State,ID,County,Age,A,B,C
0,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,9/1/16,X,AL,360,BB County,1.0,negative,negative,negative
2,9/1/16,X,AL,360,BB County,10.0,negative,negative,negative
3,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,9/1/16,X,AR,728,JJ County,3.0,negative,negative,negative
6,9/1/16,X,AR,728,JJ County,8.0,negative,negative,negative
7,9/1/16,X,AR,728,JJ County,8.0,negative,negative,negative
8,9/1/16,X,AR,728,JJ County,14.0,negative,negative,negative
9,9/1/16,X,AR,728,JJ County,5.0,negative,negative,negative
...

I am trying to group by Date (day) and ID, and then count 1) the total number of records for each day and ID, and 2) the total number of "positives" in column "A" (for example) for each day and ID. Ultimately, I would like to populate a data frame that indicates the number of positives and total number of records for each day and ID, e.g.,

Date,ID,Positive,Total
9/1/16,360,10,20
9/2/16,360,12,23
9/2/16,718,2,43
...

I originally used a double for-loop that went through each unique day and ID, but this was taking too much time. I would appreciate help on a better approach. Thanks in advance for any comments!

TayTay · Accepted Answer

I took the data you provided and created a small .csv file so you can replicate... furthermore, I changed several values to test this works:

Index,Date,City,State,ID,County,Age,A,B,C
0,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,9/1/16,X,AL,360,BB County,1.0,positive,negative,negative
2,9/1/16,X,AL,360,BB County,10.0,positive,negative,negative
3,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,9/2/16,X,AR,728,JJ County,3.0,negative,negative,negative
6,9/2/16,X,AR,728,JJ County,8.0,positive,negative,negative
7,9/2/16,X,AR,728,JJ County,8.0,negative,negative,negative
8,9/3/16,X,AR,728,JJ County,14.0,negative,negative,negative
9,9/3/16,X,AR,728,JJ County,5.0,negative,negative,negative

Once you read it in, here's how things look:

>>> X = pd.read_csv('data.csv', header=0, index_col=None).drop('Index', axis=1)
>>> print(X)

     Date City State   ID     County   Age         A         B         C
0  9/1/16    X    AL  360  BB County  29.0  negative  positive  positive
1  9/1/16    X    AL  360  BB County   1.0  positive  negative  negative
2  9/1/16    X    AL  360  BB County  10.0  positive  negative  negative
3  9/1/16    X    AL  360  BB County  11.0  negative  negative  negative
4  9/1/16    X    AR  718  LL County  67.0  negative  negative  negative
5  9/2/16    X    AR  728  JJ County   3.0  negative  negative  negative
6  9/2/16    X    AR  728  JJ County   8.0  positive  negative  negative
7  9/2/16    X    AR  728  JJ County   8.0  negative  negative  negative
8  9/3/16    X    AR  728  JJ County  14.0  negative  negative  negative
9  9/3/16    X    AR  728  JJ County   5.0  negative  negative  negative

This is the function that is applied to each group in the groupby call:

def _ct_id_pos(grp):
    return grp[grp.A == 'positive'].shape[0], grp.shape[0]

This will be a two-step process... Using pandas, you can groupby several columns and apply the above function.

# the following will have the tuple in one column
>>> X_prime = X.groupby(['Date', 'ID']).apply(_ct_id_pos).reset_index()
>>> print(X_prime)
     Date   ID       0
0  9/1/16  360  (2, 4)
1  9/1/16  718  (0, 1)
2  9/2/16  728  (1, 3)
3  9/3/16  728  (0, 2)

Notice the result of the groupby function gives us a new column with embedded tuples, so the next step is to split those out into their own respective columns and drop the embedded one:

>>> X_prime[['Positive', 'Total']] = X_prime[0].apply(pd.Series)
>>> X_prime.drop([0], axis=1, inplace=True)
>>> print(X_prime)
     Date   ID  Positive  Total
0  9/1/16  360         2      4
1  9/1/16  718         0      1
2  9/2/16  728         1      3
3  9/3/16  728         0      2

Python/Pandas: Grouping and counting records by date and ID

Answers (1)

Related Questions