Reputation: 69
I'm embarrassed to ask this but after 3 days I'm banging my head off the wall. I have a dataframe with an id and three separate fields eached populated with alternative urls. I want to choose a url in order of preference (urlA,urlB,urlC) and assign the selected to a new field called chosenUrl. I have a function that sees if a given url field is populated and if so it returns that url. I wanted to run this as an apply function in pandas to make it run as fast as possible. I have done similar applies successfully but every time i run it, it returns a blank. I can't share the actual data set but I've created an anonymised version for reference.
Any idea why it keeps returning blanks? Code:
data = {'id': ['123','456','789'],
'urlA': ['www.google.com','',''],
'urlB': ['','www.amazon.com',''],
'urlC': ['','','www.yahoo.com']
}
df = pd.DataFrame(data, columns=['id','urlA','urlB','urlC'])
#function to select populated url - does the string contain a '.'?
def selectUrl(urlA,urlB,urlC):
tmp = '.'
if tmp in str(urlA):
y= urlA
elif tmp in str(urlB):
y = urlB
elif tmp in str(urlC):
y= urlC
else:
y = ''
return y
#Pandas apply statement
df['selectedURL']=df.apply(lambda x: selectUrl(x['urlA'],x['urlB'],x['urlB']), axis=1)
I would expect this to return
Any ideas? I was thinking that maybe its returning an object of the whole field which is ignored but haven't been able to figure it out. I have no doubt i'm going to facepalm after someone points something out but I will deserve it!
Any advice is appreciated
Upvotes: 0
Views: 243
Reputation: 863341
Because here is prioritize column urlA
, then urlB
and last urlC
select this columns by priority in list (first are more prioritized), then replace empty strings to missing values, back filling missing values and fill first column by position:
df['url'] = df[['urlA','urlB','urlC']].replace('',np.nan).bfill(axis=1).iloc[:, 0]
print (df)
id urlA urlB urlC url
0 123 www.google.com www.google.com
1 456 www.amazon.com www.amazon.com
2 789 www.yahoo.com www.yahoo.com
Importance of order of values of list is possible see in changed data:
data = {'id': ['123','456','789'],
'urlA': ['www.google.com','www.google.com',''],
'urlB': ['','www.amazon.com','www.amazon.com'],
'urlC': ['www.yahoo.com','','www.yahoo.com']
}
df = pd.DataFrame(data, columns=['id','urlA','urlB','urlC'])
#A,B,C priority
df['url1'] = df[['urlA','urlB','urlC']].replace('',np.nan).bfill(axis=1).iloc[:, 0]
#C,B,A priority
df['url2'] = df[['urlC','urlB','urlA']].replace('',np.nan).bfill(axis=1).iloc[:, 0]
print (df)
id urlA urlB urlC url1 \
0 123 www.google.com www.yahoo.com www.google.com
1 456 www.google.com www.amazon.com www.google.com
2 789 www.amazon.com www.yahoo.com www.amazon.com
url2
0 www.yahoo.com
1 www.amazon.com
2 www.yahoo.com
Upvotes: 1