Reputation: 902
I have a relatively large data frame in Python (~10^6 records), structured as so:
Index,Date,City,State,ID,County,Age,A,B,C
0,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,9/1/16,X,AL,360,BB County,1.0,negative,negative,negative
2,9/1/16,X,AL,360,BB County,10.0,negative,negative,negative
3,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,9/1/16,X,AR,728,JJ County,3.0,negative,negative,negative
6,9/1/16,X,AR,728,JJ County,8.0,negative,negative,negative
7,9/1/16,X,AR,728,JJ County,8.0,negative,negative,negative
8,9/1/16,X,AR,728,JJ County,14.0,negative,negative,negative
9,9/1/16,X,AR,728,JJ County,5.0,negative,negative,negative
...
I am trying to group by Date (day) and ID, and then count 1) the total number of records for each day and ID, and 2) the total number of "positives" in column "A" (for example) for each day and ID. Ultimately, I would like to populate a data frame that indicates the number of positives and total number of records for each day and ID, e.g.,
Date,ID,Positive,Total
9/1/16,360,10,20
9/2/16,360,12,23
9/2/16,718,2,43
...
I originally used a double for-loop that went through each unique day and ID, but this was taking too much time. I would appreciate help on a better approach. Thanks in advance for any comments!
Upvotes: 1
Views: 1884
Reputation: 7170
I took the data you provided and created a small .csv file so you can replicate... furthermore, I changed several values to test this works:
Index,Date,City,State,ID,County,Age,A,B,C
0,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,9/1/16,X,AL,360,BB County,1.0,positive,negative,negative
2,9/1/16,X,AL,360,BB County,10.0,positive,negative,negative
3,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,9/2/16,X,AR,728,JJ County,3.0,negative,negative,negative
6,9/2/16,X,AR,728,JJ County,8.0,positive,negative,negative
7,9/2/16,X,AR,728,JJ County,8.0,negative,negative,negative
8,9/3/16,X,AR,728,JJ County,14.0,negative,negative,negative
9,9/3/16,X,AR,728,JJ County,5.0,negative,negative,negative
Once you read it in, here's how things look:
>>> X = pd.read_csv('data.csv', header=0, index_col=None).drop('Index', axis=1)
>>> print(X)
Date City State ID County Age A B C
0 9/1/16 X AL 360 BB County 29.0 negative positive positive
1 9/1/16 X AL 360 BB County 1.0 positive negative negative
2 9/1/16 X AL 360 BB County 10.0 positive negative negative
3 9/1/16 X AL 360 BB County 11.0 negative negative negative
4 9/1/16 X AR 718 LL County 67.0 negative negative negative
5 9/2/16 X AR 728 JJ County 3.0 negative negative negative
6 9/2/16 X AR 728 JJ County 8.0 positive negative negative
7 9/2/16 X AR 728 JJ County 8.0 negative negative negative
8 9/3/16 X AR 728 JJ County 14.0 negative negative negative
9 9/3/16 X AR 728 JJ County 5.0 negative negative negative
This is the function that is applied to each group in the groupby
call:
def _ct_id_pos(grp):
return grp[grp.A == 'positive'].shape[0], grp.shape[0]
This will be a two-step process... Using pandas, you can groupby several columns and apply the above function.
# the following will have the tuple in one column
>>> X_prime = X.groupby(['Date', 'ID']).apply(_ct_id_pos).reset_index()
>>> print(X_prime)
Date ID 0
0 9/1/16 360 (2, 4)
1 9/1/16 718 (0, 1)
2 9/2/16 728 (1, 3)
3 9/3/16 728 (0, 2)
Notice the result of the groupby function gives us a new column with embedded tuples, so the next step is to split those out into their own respective columns and drop the embedded one:
>>> X_prime[['Positive', 'Total']] = X_prime[0].apply(pd.Series)
>>> X_prime.drop([0], axis=1, inplace=True)
>>> print(X_prime)
Date ID Positive Total
0 9/1/16 360 2 4
1 9/1/16 718 0 1
2 9/2/16 728 1 3
3 9/3/16 728 0 2
Upvotes: 3