Aongoose
Aongoose

Reputation: 45

Get the daily percentages of values that fall within certain ranges

I have a large dataset of test results where I have a columns to represent the date a test was completed and number of hours it took to complete the test i.e.

df = pd.DataFrame({'Completed':['21/03/2020','22/03/2020','21/03/2020','24/03/2020','24/03/2020',], 'Hours_taken':[23,32,8,73,41]}) 

I have a months worth of test data and the tests can take anywhere from a couple of hours to a couple of days. I want to try and work out, for each day, what percentage of tests fall within the ranges of 24hrs/48hrs/72hrs ect. to complete, up to the percentage of tests that took longer than a week.

I've been able to work it out generally without taking the dates into account like so:

Lab_tests['one-day'] = Lab_tests['hours'].between(0,24)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48)
Lab_tests['GreaterThanWeek'] = Lab_tests['hours'] >168

one = Lab_tests['1-day'].value_counts().loc[True]
two = Lab_tests['two-day'].value_counts().loc[True]
eight = Lab_tests['GreaterThanWeek'].value_counts().loc[True]

print(one/10407 * 100)
print(two/10407 * 100)
print(eight/10407 * 100)

Ideally I'd like to represent the percentages in another dataset where the rows represent the dates and the columns represent the data ranges. But I can't work out how to take what I've done and modify it to get these percentages for each date. Is this possible to do in pandas?

This question, Counting qualitative values based on the date range in Pandas is quite similar but the fact that I'm counting the occurrences in specified ranges is throwing me off and I haven't been able to get a solution out of it.

Bonus Question

I'm sure you've noticed my current code is not the most elegant thing in the world, is the a cleaner way to do what I've done above, as I'm doing that for every data range that I want?

Edit: So the Output for the sample data given would look like so:

df = pd.DataFrame({'1-day':[100,0,0,0], '2-day':[0,100,0,50],'3-day':[0,0,0,0],'4-day':[0,0,0,50]},index=['21/03/2020','22/03/2020','23/03/2020','24/03/2020'])

Upvotes: 1

Views: 284

Answers (1)

Dave
Dave

Reputation: 2049

You're almost there. You just need to do a few final steps:

First, cast your bools to ints, so that you can sum them.

Lab_tests['one-day'] = Lab_tests['hours'].between(0,24).astype(int)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48).astype(int)
Lab_tests['GreaterThanWeek'] = (Lab_tests['hours'] > 168).astype(int)

    Completed  hours  one-day  two-day  GreaterThanWeek
0  21/03/2020     23        1        0                0
1  22/03/2020     32        0        1                0
2  21/03/2020      8        1        0                0
3  24/03/2020     73        0        0                0
4  24/03/2020     41        0        1                0

Then, drop the hours column and roll the rest up to the level of Completed:

Lab_tests['one-day'] = Lab_tests['hours'].between(0,24).astype(int)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48).astype(int)
Lab_tests['GreaterThanWeek'] = (Lab_tests['hours'] > 168).astype(int)

Lab_tests.drop('hours', axis=1).groupby('Completed').sum()

            one-day  two-day  GreaterThanWeek
Completed                                    
21/03/2020        2        0                0
22/03/2020        0        1                0
24/03/2020        0        1                0

EDIT: To get to percent, you just need to divide each column by the sum of all three. You can sum columns by defining the axis of the sum:

...
daily_totals = Lab_tests.drop('hours', axis=1).groupby('Completed').sum()
daily_totals.sum(axis=1)                                                                                                                                                                           

Completed
21/03/2020    2
22/03/2020    1
24/03/2020    1
dtype: int64

Then divide the daily totals dataframe by the column-wise sum of the daily totals (again, we use axis to define whether each value of the series will be the divisor for a row or a column.):

daily_totals.div(daily_totals.sum(axis=1), axis=0)                                                                                                                                           

            one-day  two-day  GreaterThanWeek
Completed                                    
21/03/2020      1.0      0.0              0.0
22/03/2020      0.0      1.0              0.0
24/03/2020      0.0      1.0              0.0

Upvotes: 1

Related Questions