Sathish Kumar
Sathish Kumar

Reputation: 33

Replace with first occurrence value for duplicate columns using pandas or python

I have data like

ca ca ca 120.00

ca cc cd 130.00

ca ca ca 135.23

ca ha ca 60.00

ca ha ca 50.00

If first 3 columns are equal then fourth column value should be the first occurrence. I want data like

ca ca ca 120.00

ca cc cd 130.00

ca ca ca 120.00

ca ha ca 60.00

ca ha ca 60.00

Please help me to solve this

Upvotes: 3

Views: 316

Answers (2)

jezrael
jezrael

Reputation: 862581

Use GroupBy.transform with GroupBy.first

Dynamic solution with selecting first 3 columns to list and processing 4th column assigned back:

df.iloc[:, 3] = df.groupby(df.columns[:3].tolist())[df.columns[3]].transform('first')
print (df)
    0   1   2      3
0  ca  ca  ca  120.0
1  ca  cc  cd  130.0
2  ca  ca  ca  120.0
3  ca  ha  ca   60.0
4  ca  ha  ca   60.0

If there are 4 columns names like a,b,c,d solution is simplier:

df['d'] = df.groupby(['a','b','c'])['d'].transform('first')
print (df)
    a   b   c      d
0  ca  ca  ca  120.0
1  ca  cc  cd  130.0
2  ca  ca  ca  120.0
3  ca  ha  ca   60.0
4  ca  ha  ca   60.0

Upvotes: 2

scsanty
scsanty

Reputation: 356

Considering this is your DataFrame:

import pandas as pd
df = pd.DataFrame([['ca', 'ca', 'ca', 120.00], ['ca', 'cc', 'cd', 130.00], ['ca', 'ca', 'ca', 135.23], ['ca', 'ha', 'ca', 60.00], ['ca', 'ha', 'ca', 50.00]], columns=['a', 'b', 'c', 'd'])

Let's construct a res_df, where only column A = Column b = Column C

res_df = df[(df['a'] == df['b']) & (df['a'] == df['c'])]

Let's get unique results from result set.

unique_vals = res_df['a'].unique()

Now let's loop these values. Step 1: getting the array of indices of the same type Step 2: get the value of column D of the first index that is get_indx[0] i.e. 0th position

df.loc[get_indx[0], 'd']

Step 3: set it to the original df from get_index[1:] i.e poition 1 till end

df.at[get_indx[i], 'd']

for value in unique_vals:
    get_indx = res_df[res_df['a'] == value].index

    if len(get_indx) > 1:
        for i in range(1, len(get_indx)):
            df.at[get_indx[i], 'd'] = df.loc[get_indx[0], 'd']

Upvotes: 0

Related Questions