Dimbo1979
Dimbo1979

Reputation: 69

Pandas: Select a value from a choice of three columns and assign as a value in a different column - with pandas apply

I'm embarrassed to ask this but after 3 days I'm banging my head off the wall. I have a dataframe with an id and three separate fields eached populated with alternative urls. I want to choose a url in order of preference (urlA,urlB,urlC) and assign the selected to a new field called chosenUrl. I have a function that sees if a given url field is populated and if so it returns that url. I wanted to run this as an apply function in pandas to make it run as fast as possible. I have done similar applies successfully but every time i run it, it returns a blank. I can't share the actual data set but I've created an anonymised version for reference.

Any idea why it keeps returning blanks? Code:

data = {'id': ['123','456','789'],
        'urlA': ['www.google.com','',''],
    'urlB': ['','www.amazon.com',''],
    'urlC': ['','','www.yahoo.com']
        }
df = pd.DataFrame(data, columns=['id','urlA','urlB','urlC'])

#function to select populated url - does the string contain a '.'?
def selectUrl(urlA,urlB,urlC):
    tmp = '.'
    if tmp in str(urlA):
        y= urlA
    elif tmp in str(urlB):
        y = urlB
    elif tmp in str(urlC):
        y= urlC
    else:
        y = ''
    return y

#Pandas apply statement
df['selectedURL']=df.apply(lambda x: selectUrl(x['urlA'],x['urlB'],x['urlB']), axis=1)

I would expect this to return

enter image description here

Any ideas? I was thinking that maybe its returning an object of the whole field which is ignored but haven't been able to figure it out. I have no doubt i'm going to facepalm after someone points something out but I will deserve it!

Any advice is appreciated

Upvotes: 0

Views: 243

Answers (1)

jezrael
jezrael

Reputation: 863341

Because here is prioritize column urlA, then urlB and last urlC select this columns by priority in list (first are more prioritized), then replace empty strings to missing values, back filling missing values and fill first column by position:

df['url'] = df[['urlA','urlB','urlC']].replace('',np.nan).bfill(axis=1).iloc[:, 0]
print (df)
    id            urlA            urlB           urlC             url
0  123  www.google.com                                 www.google.com
1  456                  www.amazon.com                 www.amazon.com
2  789                                  www.yahoo.com   www.yahoo.com

Importance of order of values of list is possible see in changed data:

data = {'id': ['123','456','789'],
        'urlA': ['www.google.com','www.google.com',''],
    'urlB': ['','www.amazon.com','www.amazon.com'],
    'urlC': ['www.yahoo.com','','www.yahoo.com']
        }
df = pd.DataFrame(data, columns=['id','urlA','urlB','urlC'])

#A,B,C priority
df['url1'] = df[['urlA','urlB','urlC']].replace('',np.nan).bfill(axis=1).iloc[:, 0]
#C,B,A priority
df['url2'] = df[['urlC','urlB','urlA']].replace('',np.nan).bfill(axis=1).iloc[:, 0]
print (df)
    id            urlA            urlB           urlC            url1  \
0  123  www.google.com                  www.yahoo.com  www.google.com   
1  456  www.google.com  www.amazon.com                 www.google.com   
2  789                  www.amazon.com  www.yahoo.com  www.amazon.com   

             url2  
0   www.yahoo.com  
1  www.amazon.com  
2   www.yahoo.com  

Upvotes: 1

Related Questions