Praveen Bushipaka
Praveen Bushipaka

Reputation: 493

Filling columns based on other dataframe columns

I have two data sets

    df1 = pd.DataFrame ({"skuid" :("A","B","C","D"), "price": (0,0,0,0)})
    df2 = pd.DataFrame ({"skuid" :("A","B","C","D"),"salesprice" :(10,0,0,30),"regularprice" : (9,10,0,2)})

I want to insert sales price and regular price in price with conditions: If df1 skuid and df2 skuid matches and df2 salesprice is not zero, use salesprice as price value. if sku's match and df2 salesprice is zero, use regularprice. if not use zero as price value.

def pric(df1,df2):
if (df1['skuid'] == df2['skuid'] and salesprice !=0): 
 price = salesprice 
elif (df1['skuid'] == df2['skuid'] and regularprice !=0):
 price = regularprice
else:
 price = 0

I made a function with similar conditions but its not working. the result should look like in df1

skuid  price
  A      10
  B      10
  C      0
  D      30

Thanks.

Upvotes: 1

Views: 92

Answers (1)

Josh Purtell
Josh Purtell

Reputation: 364

So there are a number of issues with the function given above. Here are a few in no particular order:

  1. Indentation in python matters https://docs.python.org/2.0/ref/indentation.html
  2. Vectorized functions versus loops. The function you give looks vaguely like it expects to be applied on a vectorized basis, but python doesn't work like that. You need to loop through the rows you want to look at (https://wiki.python.org/moin/ForLoop). While there is support for column transformations in python (which work without loops), they need to be invoked specifically (here's some documentation for one instance of such functionality https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html).
  3. Relatedly, accessing dataframe elements and indexing Indexing Pandas data frames: integer rows, named columns
  4. Return: if you want your python function to give you a result, you should have it return the value. Not all programming languages require this (julia), but in python you should/must.
  5. Generality. This isn't strictly necessary in a one-off application, but your function is vulnerable to breaking if you change, for example, the column names in the dataframe. It is better practice to allow the user to give the relevant names in the input, for this reason and for simple flexibility.

Here is a version of your function which was more or less minimally change to fix the above specific issues

import pandas as pd

df1 = pd.DataFrame({"skuid" :("A","B","C","D"), "price": (0,0,0,0)})
df2 = pd.DataFrame({"skuid" :("A","B","C","D"),"salesprice" :(10,0,0,30),"regularprice" : (9,10,0,2)})


def pric(df1, df2, id_colname,df1_price_colname, df2_salesprice_colname,df2_regularprice_colname):
    for i in range(df1.shape[0]):
        for j in range(df2.shape[0]):
            if (df1.loc[df1.index[i],id_colname] == df2.loc[df2.index[j],id_colname] and df2.loc[df2.index[j],df2_salesprice_colname] != 0):
             df1.loc[df1.index[i],df1_price_colname] = df2.loc[df2.index[j],df2_salesprice_colname]
             break
            elif (df1.loc[df1.index[i],id_colname] == df2.loc[df2.index[j],id_colname] and df2.loc[df2.index[j],df2_regularprice_colname] != 0):
             df1.loc[df1.index[i],df1_price_colname]  = df2.loc[df2.index[j],df2_regularprice_colname]
             break
    return df1

for which entering


df1_imputed=pric(df1,df2,'skuid','price','salesprice','regularprice')
print(df1_imputed['price'])

gives

0    10
1    10
2     0
3    30
Name: price, dtype: int64

Notice how the function loops through row indices before checking equality conditions on specific elements given by a row-index / column pair.

A few things to consider:

  • Why does the code loop through df1 "above" the loop through df2? Relatedly, what purpose does the break condition serve?
  • Why was the else condition omitted?
  • What is 'df1.loc[df1.index[i],id_colname]' all about? (hint: check one of the above links)

Upvotes: 1

Related Questions