Reputation: 133
I'm sure the answer to this is simple - I just can't it for some reason.
I'd like to extract the URL Path from a DataFrame of URLs without using a for loop - as i'll be running this against 1M+ rows and loops are too slow.
from urllib.parse import urlparse
d = {'urls': ['https://www.example.com/ex/1','https://www.example.com/1/ex']}
df = pd.DataFrame(data=d)
df
df['urls'].apply(urlparse)
Above is where i'm at, which returns an object of all parts of the URL returned by urllib
The desired end result is a DataFrame like the below:
d = {'urls': ['https://www.example.com/ex/1','https://www.example.com/1/ex'], 'url_path': ['/ex/1', '/1/ex']}
If anyone knows how to solve this - i'd appreciate the help!
Thanks!
Upvotes: 0
Views: 551
Reputation: 6114
The docstring of urlparse
clearly says that its result is a named 6-tuple with such fields:
<scheme>://<netloc>/<path>;<params>?<query>#<fragment>
So the solution is two commands:
2
of urlparse
resultorient='list'
arg to the to_dict
DataFrame methoddf['paths'] = df['urls'].apply(lambda x: urlparse(x)[2])
df.to_dict(orient='list')
Results in
{'urls': ['https://www.example.com/ex/1', 'https://www.example.com/1/ex'],
'paths': ['/ex/1', '/1/ex']}
Upvotes: 1