pandas python Update subset of column A based on subset of one or more other columns

Question

Edit I've revised portions of the description below to clarify what I mean by "feature" and "group", fix a typo, and include additional code I've tried.

My pandas df has 4.5 mil rows and 23 columns. The table below shows a few lines from df2 which is generated from df. It shows two groups (eeskin and hduquant) and three features (failed, exit_status, and job_number):

# report by group

        group      feature  #_cats #_jobs  rank        top_value  freq  \
10   eeskin    failed       1       6     -1     100              6      
21   eeskin    exit_status  1       6     -1     0                6     
0    eeskin    job_number   1       6     -1     4.08219e+06      6      
21   hduquant  exit_status  5       64    -1     37               58     
11   hduquant  failed       2       64    -1     0                63     
1    hduquant  job_number   2       64    -1     4.07192e+06      61

The "rank" column value of -1 is a placeholder.

I want to update each group's rank for each feature. "Feature" in this case means each unique value in the "feature" column: failed, exit_status, and job_number. Updating the rankings for job_number, for example, means modifying the values in column "rank" only on rows where column "feature" equals job_number. As it turns out each of these rows also corresponds to a different group value in the "group" column.

So rather than update all values in column "rank" at once, I want to do them feature by feature where each write updates the values for all groups on a single feature.

The rank for feature "job_number" is based on the value of "#_jobs" col (highest number of jobs is rank 1). For feature "failed", rank is based on the "freq" of "top_value". exits_status can remain -1 for now.

The result should look like this:

        group      feature  #_cats #_jobs  rank        top_value  freq  \
10   eeskin    failed       1       6      1     100              6      
21   eeskin    exit_status  1       6     -1     0                6     
0    eeskin    job_number   1       6      2     4.08219e+06      6      
21   hduquant  exit_status  5       64    -1     37               58     
11   hduquant  failed       2       64     2     0                63     
1    hduquant  job_number   2       64     1     4.07192e+06      61

"eeskin" ranks 1 for failed and 2 for job_number. "hdquant" ranks 2 for failed and 1 for job_number.

I am able to update the rank values for job_number with this code:

if feat == 'job_number':
     grouped = grouped.sort_values("#_jobs", ascending=False)
     grouped['rank'] = grouped.index + 1

        group      feature  #_cats #_jobs  rank        top_value  freq  \
10   eeskin    failed       1       6     -1     100              6      
21   eeskin    exit_status  1       6     -1     0                6     
0    eeskin    job_number   1       6      2     4.08219e+06      6      
21   hduquant  exit_status  5       64    -1     37               58     
11   hduquant  failed       2       64    -1     0                63     
1    hduquant  job_number   2       64     1     4.07192e+06      61

But when I try to update both, neither is updated:

feat = ['job_number', 'failed']

for f in feat:
    if f == 'job_number':
        grouped = grouped.sort_values("#_jobs", ascending=False)
        grouped['rank'] = grouped.index + 1
    elif f == 'failed':  #  or f == 'exit_status'
        x = len(not grouped[f] == 0)
        grouped['x'] = x
        grouped = grouped.sort_values("x", ascending=False)
        grouped['rank'] = grouped.index + 1      
        del grouped['x']

        group      feature  #_cats #_jobs  rank        top_value  freq  \
10   eeskin    failed       1       6     -1     100              6      
21   eeskin    exit_status  1       6     -1     0                6     
0    eeskin    job_number   1       6     -1     4.08219e+06      6      
21   hduquant  exit_status  5       64    -1     37               58     
11   hduquant  failed       2       64    -1     0                63     
1    hduquant  job_number   2       64    -1     4.07192e+06      61

I've tried to implement the suggestion from Matt W. but so far without success:

df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)

I modified his code a bit as follows but also without success:

df2.loc[df2['feature' == 'job_number'] & df2['rank']] = (df2.loc[df2['#_jobs']].rank(ascending=False))

Addendum @Matt W.

Input:

import pandas as pd

df = pd.DataFrame([['g1', 'u1', 3902779, '2018-09-27 21:38:06', '2018-10-01 07:24:38', '2018-10-01 08:00:42', 0, 0, 'single', 1, 55696, 609865728.0, 4.0, 6.0, 0, 0, 4.0, 0, 'single', 1, 0, pd.Timedelta('3 days 09:46:32'), pd.Timedelta('00:36:04')]], 
                          columns=['group', 'owner', 'job_number', 'submission_time', 'start_time', 'end_time', 'failed', 'exit_status', 'granted_pe', 'slots', 'task_number', 'maxvmem', 'h_data', 'h_rt', 'highp', 'exclusive', 'h_vmem', 'gpu', 'pe', 'slot', 'campus', 'wait_time', 'wtime'])
df = (df.astype(dtype={'group':'str', 'owner':'str', 'job_number':'int', 'submission_time':'datetime64[ns]', 'start_time':'datetime64[ns]', 'end_time':'datetime64[ns]', 'failed':'int', 'exit_status':'int', 'granted_pe':'str', 'slots':'int', 'task_number':'int', 'maxvmem':'float', 'h_data':'float', 'h_rt':'float', 'highp':'int', 'exclusive':'int', 'h_vmem':'float', 'gpu':'int', 'pe':'str', 'slot':'int', 'campus':'int', 'wait_time':'timedelta64[ns]', 'wtime':'timedelta64[ns]'}))
df

Output:

         group  owner  job_number      submission_time           start_time             end_time  failed  exit_status  granted_pe  slots  task_number       maxvmem  h_data  h_rt  highp  exclusive  h_vmem  gpu      pe  slot  campus       wait_time     wtime
      0  g1     u1     3902779     2018-09-27 21:38:06  2018-10-01 07:24:38  2018-10-01 08:00:42  0       0            single      1      55696         609865728.0  4.0     6.0   0      0          4.0     0    single  1     0      3 days 09:46:32  00:36:04  
4080243  g50    u92    4071923     2018-10-25 02:08:14  2018-10-27 01:41:58  2018-10-27 02:08:50  0       0            shared      1      119          7.654482e+08  2.5     1.5   0      1          16.0    0    shared  1     0      1 days 23:33:44  00:26:52
4080244  g50    u92    4071922     2018-10-25 02:08:11  2018-10-27 01:46:53  2018-10-27 02:08:53  0       0            shared      1      2208         1.074463e+09  2.5     1.5   0      10         24.0    0    shared  1     0      1 days 23:38:42  00:22:00

The code produces the first line. I tacked on a couple more lines just for variety.

There are 203 groups, 699 owners. There are thousands of jobs: a "job" is defined as a unique combination of job_number, task_number, and submission_time.

I want to create an overall report and one report per group, both focused on resource usage.

Components of overall report:

General stats:

count, mean, std, min, 25%, 50%, 75%, max (numerical)
count, unique, top, freq (string)
count, first, last # time delta cols (time delta)

Jobs:

job_number with the most task_numbers, job_number w most submission times
job (as defined above) with earliest/latest
- submission_time, start_time, and end_time
job with most
- failed != 0
- exit_status != 0
job with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
job with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe
job with longest/shortest cumulative
- wait_time and wtime

Owners:

owner with the most jobs
owner with earliest/latest
- submission_time, start_time, end_time
owner with most
- failed != 0
- exit_status != 0
owner with most (sum of values)
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, gpu
owner with most (count/len)
- pe == single
- pe == shared
- pe == for each addtl category of pe
owner with longest/shortest cumulative
- wait_time and wtime

Groups:

group with the most jobs
group with the most owners
group with earliest/latest
- submission_time, start_time, and end_time
group with most
- failed != 0
- exit_status != 0
group with most (sum of values) of each
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
group with most
- pe == single
- pe == shared
- pe == for each addtl category of pe
group with longest/shortest cumulative
- wait_time and wtime

Components of individual "by group" reports:

By feature (column in df):

General stats:

count, mean, std, min, 25%, 50%, 75%, max (numerical)
count, unique, top, freq (string)
count, first, last # time delta cols (time delta)

The group's stats:

By job:

job_number with the most task_numbers, job_number w most submission_times
job (as defined above) with earliest/latest
- submission_time, start_time, and end_time
job with most
- failed != 0
- exit_status != 0
job with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
job with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)
job with longest/shortest cumulative
- wait_time and wtime

By owner:

owner with the most jobs
owner with earliest/latest
- submission_time, start_time, end_time
owner with most
- failed != 0
- exit_status != 0
owner with most
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
owner with most
- pe == single (count/len)
- pe == shared (count/len)
- pe == for each addtl category of pe (count/len)
owner with longest/shortest cumulative
- wait_time and wtime

By ranking:

Here I want each group ranked against all other groups, from 1 with highest usage or most "fails" to 203 for lowest usage. I will use these values to plot a graph for each group.

Rank for:

number of
- jobs, job_numbers, task_numbers, submission times
time of first
- submission_time, start_time, end_time
time of last
- submission_time, start_time, end_time
number of
- failed != 0
- exit_status != 0
number of
- granted_pe, slots, maxvmem, h_data, h_rt, exclusive, h_vmem, and gpu
number of
- pe == single
- pe == shared
- pe == for each addtl category of pe
total cumulative for all jobs
- wait_time and wtime

Matt W. · Accepted Answer

You can do this an easier way by using pandas .loc

Initialize dataframe:

df = pd.DataFrame({'group':['e','e','e','h','h','h'],
                   'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
                   'cats':[1, 1, 1, 5, 2, 2],
                   'jobs':[1, 1, 1, 64, 64, 64],
                   'rank':[-1, -1, -1, -1, -1, -1],
                   'topvalue':[100, 0, 4, 37, 0, 3.9],
                   'freq':[1, 1, 1, 58, 63, 61]
})

We want to rank jobs feature so we just isolate the rank locations using .loc, and then on the right side of the assignment, we isolate the jobs column using .loc and use the .rank() function

Rank job feature, by jobs value:

df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)

Rank failure feature by frequency where top value is not 0:

For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.

This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0 ranks stay as -1

df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
        df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)

This way we don't filter out the 0s.

df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
        df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)

pandas python Update subset of column A based on subset of one or more other columns

Answers (1)

Related Questions