Reputation: 3161
I am trying to extract the top level URLs and ignore the paths. I am using the code below:
for row in Mexico['Page URL']:
parsed_uri = urlparse( 'http://www.one.com.mx/furl/Conteúdo Raiz/Meu' )
Mexico['SubDomain'] = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
This script has been running for the past hour. When I ran it, it gave the following warning:
/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
This is separate from the ipykernel package so we can avoid doing imports until
I would appreciate it if anyone could advise on a quicker way, perhaps pointers on the method the 'warning' suggests
Upvotes: 2
Views: 1941
Reputation: 880957
Calling a Python function once for each row of a Series can be very slow if the Series is very long. The key to speeding this up is replacing the multiple function calls with (ideally) one vectorized function call.
When using Pandas, that means rewriting the Python function (e.g. urlparse
) in terms of vectorized string functions.
Since urlparse
is a fairly complicated function, rewriting urlparse
would be pretty hard. However, in your case we have the advantage of knowing that all the urls that we care about begin with https://
or http://
. So we don't need urlparse
in its full-blow generality. We can perhaps make do with a much simpler rule: The netloc is whatever characters follow https://
or http://
until the end of the string or the next /
, whichever comes first.
If that is true, then
Mexico['Page URL'].str.extract('(https?://[^/]+)', expand=False)
can extract all the netlocs from the entire Series Mexico['Page URL']
without looping and without multiple urlparse
function calls. This will be much faster when len(Mexico)
is big.
For example,
import pandas as pd
Mexico = pd.DataFrame({'Page URL':['http://www.one.com.mx/furl/Conteúdo Raiz/Meu',
'https://www.one.com.mx/furl/Conteúdo Raiz/Meu']})
Mexico['SubDomain'] = Mexico['Page URL'].str.extract('(https?://[^/]+)', expand=False)
print(Mexico)
yields
Page URL SubDomain
0 http://www.one.com.mx/furl/Conteúdo Raiz/Meu http://www.one.com.mx
1 https://www.one.com.mx/furl/Conteúdo Raiz/Meu https://www.one.com.mx
Upvotes: 3