Karl Baker
Karl Baker

Reputation: 913

pandas python Update subset of column A based on subset of one or more other columns

Edit I've revised portions of the description below to clarify what I mean by "feature" and "group", fix a typo, and include additional code I've tried.

My pandas df has 4.5 mil rows and 23 columns. The table below shows a few lines from df2 which is generated from df. It shows two groups (eeskin and hduquant) and three features (failed, exit_status, and job_number):

# report by group

        group      feature  #_cats #_jobs  rank        top_value  freq  \
10   eeskin    failed       1       6     -1     100              6      
21   eeskin    exit_status  1       6     -1     0                6     
0    eeskin    job_number   1       6     -1     4.08219e+06      6      
21   hduquant  exit_status  5       64    -1     37               58     
11   hduquant  failed       2       64    -1     0                63     
1    hduquant  job_number   2       64    -1     4.07192e+06      61     

The "rank" column value of -1 is a placeholder.

I want to update each group's rank for each feature. "Feature" in this case means each unique value in the "feature" column: failed, exit_status, and job_number. Updating the rankings for job_number, for example, means modifying the values in column "rank" only on rows where column "feature" equals job_number. As it turns out each of these rows also corresponds to a different group value in the "group" column.

So rather than update all values in column "rank" at once, I want to do them feature by feature where each write updates the values for all groups on a single feature.

The rank for feature "job_number" is based on the value of "#_jobs" col (highest number of jobs is rank 1). For feature "failed", rank is based on the "freq" of "top_value". exits_status can remain -1 for now.

The result should look like this:

        group      feature  #_cats #_jobs  rank        top_value  freq  \
10   eeskin    failed       1       6      1     100              6      
21   eeskin    exit_status  1       6     -1     0                6     
0    eeskin    job_number   1       6      2     4.08219e+06      6      
21   hduquant  exit_status  5       64    -1     37               58     
11   hduquant  failed       2       64     2     0                63     
1    hduquant  job_number   2       64     1     4.07192e+06      61     

"eeskin" ranks 1 for failed and 2 for job_number. "hdquant" ranks 2 for failed and 1 for job_number.

I am able to update the rank values for job_number with this code:

if feat == 'job_number':
     grouped = grouped.sort_values("#_jobs", ascending=False)
     grouped['rank'] = grouped.index + 1
        group      feature  #_cats #_jobs  rank        top_value  freq  \
10   eeskin    failed       1       6     -1     100              6      
21   eeskin    exit_status  1       6     -1     0                6     
0    eeskin    job_number   1       6      2     4.08219e+06      6      
21   hduquant  exit_status  5       64    -1     37               58     
11   hduquant  failed       2       64    -1     0                63     
1    hduquant  job_number   2       64     1     4.07192e+06      61     

But when I try to update both, neither is updated:

feat = ['job_number', 'failed']

for f in feat:
    if f == 'job_number':
        grouped = grouped.sort_values("#_jobs", ascending=False)
        grouped['rank'] = grouped.index + 1
    elif f == 'failed':  #  or f == 'exit_status'
        x = len(not grouped[f] == 0)
        grouped['x'] = x
        grouped = grouped.sort_values("x", ascending=False)
        grouped['rank'] = grouped.index + 1      
        del grouped['x']
        group      feature  #_cats #_jobs  rank        top_value  freq  \
10   eeskin    failed       1       6     -1     100              6      
21   eeskin    exit_status  1       6     -1     0                6     
0    eeskin    job_number   1       6     -1     4.08219e+06      6      
21   hduquant  exit_status  5       64    -1     37               58     
11   hduquant  failed       2       64    -1     0                63     
1    hduquant  job_number   2       64    -1     4.07192e+06      61     

I've tried to implement the suggestion from Matt W. but so far without success:

df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)

I modified his code a bit as follows but also without success:

df2.loc[df2['feature' == 'job_number'] & df2['rank']] = (df2.loc[df2['#_jobs']].rank(ascending=False))

Addendum @Matt W.

Input:

import pandas as pd

df = pd.DataFrame([['g1', 'u1', 3902779, '2018-09-27 21:38:06', '2018-10-01 07:24:38', '2018-10-01 08:00:42', 0, 0, 'single', 1, 55696, 609865728.0, 4.0, 6.0, 0, 0, 4.0, 0, 'single', 1, 0, pd.Timedelta('3 days 09:46:32'), pd.Timedelta('00:36:04')]], 
                          columns=['group', 'owner', 'job_number', 'submission_time', 'start_time', 'end_time', 'failed', 'exit_status', 'granted_pe', 'slots', 'task_number', 'maxvmem', 'h_data', 'h_rt', 'highp', 'exclusive', 'h_vmem', 'gpu', 'pe', 'slot', 'campus', 'wait_time', 'wtime'])
df = (df.astype(dtype={'group':'str', 'owner':'str', 'job_number':'int', 'submission_time':'datetime64[ns]', 'start_time':'datetime64[ns]', 'end_time':'datetime64[ns]', 'failed':'int', 'exit_status':'int', 'granted_pe':'str', 'slots':'int', 'task_number':'int', 'maxvmem':'float', 'h_data':'float', 'h_rt':'float', 'highp':'int', 'exclusive':'int', 'h_vmem':'float', 'gpu':'int', 'pe':'str', 'slot':'int', 'campus':'int', 'wait_time':'timedelta64[ns]', 'wtime':'timedelta64[ns]'}))
df

Output:

         group  owner  job_number      submission_time           start_time             end_time  failed  exit_status  granted_pe  slots  task_number       maxvmem  h_data  h_rt  highp  exclusive  h_vmem  gpu      pe  slot  campus       wait_time     wtime
      0  g1     u1     3902779     2018-09-27 21:38:06  2018-10-01 07:24:38  2018-10-01 08:00:42  0       0            single      1      55696         609865728.0  4.0     6.0   0      0          4.0     0    single  1     0      3 days 09:46:32  00:36:04  
4080243  g50    u92    4071923     2018-10-25 02:08:14  2018-10-27 01:41:58  2018-10-27 02:08:50  0       0            shared      1      119          7.654482e+08  2.5     1.5   0      1          16.0    0    shared  1     0      1 days 23:33:44  00:26:52
4080244  g50    u92    4071922     2018-10-25 02:08:11  2018-10-27 01:46:53  2018-10-27 02:08:53  0       0            shared      1      2208         1.074463e+09  2.5     1.5   0      10         24.0    0    shared  1     0      1 days 23:38:42  00:22:00

The code produces the first line. I tacked on a couple more lines just for variety.

There are 203 groups, 699 owners. There are thousands of jobs: a "job" is defined as a unique combination of job_number, task_number, and submission_time.

I want to create an overall report and one report per group, both focused on resource usage.

Components of overall report:

General stats:

Jobs:

Owners:

Groups:

Components of individual "by group" reports:

By feature (column in df):

General stats:

The group's stats:

By job:

By owner:

By ranking:

Here I want each group ranked against all other groups, from 1 with highest usage or most "fails" to 203 for lowest usage. I will use these values to plot a graph for each group.

Rank for:

Upvotes: 3

Views: 689

Answers (1)

Matt W.
Matt W.

Reputation: 3722

You can do this an easier way by using pandas .loc

Initialize dataframe:

df = pd.DataFrame({'group':['e','e','e','h','h','h'],
                   'feature':['fail', 'exit', 'job', 'exit', 'fail', 'job'],
                   'cats':[1, 1, 1, 5, 2, 2],
                   'jobs':[1, 1, 1, 64, 64, 64],
                   'rank':[-1, -1, -1, -1, -1, -1],
                   'topvalue':[100, 0, 4, 37, 0, 3.9],
                   'freq':[1, 1, 1, 58, 63, 61]
})

We want to rank jobs feature so we just isolate the rank locations using .loc, and then on the right side of the assignment, we isolate the jobs column using .loc and use the .rank() function

Rank job feature, by jobs value:

df.loc[df.feature == 'job', 'rank'] = df.loc[df.feature == 'job', 'jobs'].rank(ascending=False)

Rank failure feature by frequency where top value is not 0:

For this one you do rank the ones that are 0 which seems to go against what you said. So we'll do this two ways.

This way we filter out the 0s to start, and rank everything else. This will have the top_value == 0 ranks stay as -1

df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
        df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)

This way we don't filter out the 0s.

df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'rank'] = (
        df.loc[(df.feature == 'fail') & (df.topvalue != 0), 'freq']).rank(ascending=True)

Upvotes: 1

Related Questions