vesuvius
vesuvius

Reputation: 435

Compare corresponding columns with each other and store the result in a new column

I had a data which I pivoted using pivot table method , now the data looks like this:

rule_id  a   b   c
50211    8   0   0
50249    16  0   3
50378    0   2   0
50402    12  9   6

I have set 'rule_id' as index. Now I compared one column to it's corresponding column and created another column with it's result. The idea is if the first column has a value other than 0 and the second column , to which the first column is compared to ,has 0 , then 100 should be updated in the newly created column, but if the situation is vice-versa then 'Null' should be updated. If both column have 0 , then also 'Null' should be updated. If the last column has value 0 , then 'Null' should be updated and other than 0 , then 100 should be updated. But if both the columns have values other than 0(like in the last row of my data) , then the comparison should be like this for column a and b:

value_of_b/value_of_a *50 + 50

and for column b and c:

value_of_c/value_of_b *25 + 25

and similarly if there are more columns ,then the multiplication and addition value should be 12.5 and so on.

I was able to achieve all the above things apart from the last result which is the division and multiplication stuff. I used this code:

m = df.eq(df.shift(-1, axis=1))

arr = np.select([df ==0, m], [np.nan, df], 1*100)

df2 = pd.DataFrame(arr, index=df.index).rename(columns=lambda x: f'comp{x+1}')

df3 = df.join(df2)

df is the dataframe which stores my pivoted table data which I mentioned at the start. After using this code my data looks like this:

   rule_id   a   b   c  comp1 comp2 comp3
    50211    8   0   0   100   NaN   NaN
    50249    16  0   3   100   NaN   100
    50378    0   2   0   NaN   100   NaN
    50402    12  9   6   100   100   100

But I want the data to look like this:

   rule_id   a   b   c  comp1 comp2 comp3
    50211    8   0   0   100   NaN   NaN
    50249    16  0   3   100   NaN   100
    50378    0   2   0   NaN   100   NaN
    50402    12  9   6   87.5  41.67 100

If you guys can help me get the desired data , I would greatly appreciate it.

Edit: This is how my data looks:

enter image description here

Upvotes: 1

Views: 68

Answers (2)

Serge Ballesta
Serge Ballesta

Reputation: 148890

The problem is that the coefficient to use when building the new compx column does not depend only on the columns position. In fact in each row it is reset to its maximum of 50 after each 0 value and is half of previous one after a non 0 value. Those resetable series are hard to vectorize in pandas, especially in rows. Here I would build a companion dataframe holding only those coefficients, and use directly the numpy underlying arrays to compute them as efficiently as possible. Code could be:

# transpose the dataframe to process columns instead of rows
coeff = df.T

# compute the coefficients
for name, s in coeff.items():
    top = 100              # start at 100
    r = []
    for i, v in enumerate(s):
        if v == 0:         # reset to 100 on a 0 value
            top=100
        else:
            top = top/2    # else half the previous value
        r.append(top)
    coeff.loc[:, name] = r # set the whole column in one operation

# transpose back to have a companion dataframe for df
coeff = coeff.T

# build a new column from 2 consecutive ones, using the coeff dataframe
def build_comp(col1, col2, i):
    df['comp{}'.format(i)] = np.where(df[col1] == 0, np.nan,
                                      np.where(df[col2] == 0, 100,
                                               df[col2]/df[col1]*coeff[col1]
                                               +coeff[col1]))

old = df.columns[0]          # store name of first column

# Ok, enumerate all the columns (except first one)
for i, col in enumerate(df.columns[1:], 1):
    build_comp(old, col, i)
    old = col                # keep current column name for next iteration

# special processing for last comp column
df['comp{}'.format(i+1)] = np.where(df[col] == 0, np.nan, 100)

With this initial dataframe:

date     2019-04-25 15:08:23  2019-04-25 16:14:14  2019-04-25 16:29:05  2019-04-25 16:36:32
rule_id
50402                      0                    0                    9                    0
51121                      0                    1                    0                    0
51147                      0                    1                    0                    0
51183                      2                    0                    0                    0
51283                      0                   12                    9                    6
51684                      0                    1                    0                    0
52035                      0                    4                    3                    2

it gives as expected:

date     2019-04-25 15:08:23  2019-04-25 16:14:14  2019-04-25 16:29:05  2019-04-25 16:36:32  comp1  comp2       comp3  comp4
rule_id
50402                      0                    0                    9                    0    NaN    NaN  100.000000    NaN
51121                      0                    1                    0                    0    NaN  100.0         NaN    NaN
51147                      0                    1                    0                    0    NaN  100.0         NaN    NaN
51183                      2                    0                    0                    0  100.0    NaN         NaN    NaN
51283                      0                   12                    9                    6    NaN   87.5   41.666667  100.0
51684                      0                    1                    0                    0    NaN  100.0         NaN    NaN
52035                      0                    4                    3                    2    NaN   87.5   41.666667  100.0

Upvotes: 1

Vikas Gautam
Vikas Gautam

Reputation: 441

Ok, I think you can iterate over your dataframe df and use some if-else to get the desired output.

for i in range(len(df.index)):
    if df.iloc[i,1]!=0 and df.iloc[i,2]==0:          # column start from index 0
        df.loc[i,'colname'] = 'whatever you want'    # so rule_id is column 0
    elif:                                                
    .
    .
    .

Upvotes: 0

Related Questions