Pandas function taking too long

Question

I am trying to extract the top level URLs and ignore the paths. I am using the code below:

for row in Mexico['Page URL']:
    parsed_uri = urlparse( 'http://www.one.com.mx/furl/Conteúdo Raiz/Meu' )
    Mexico['SubDomain'] = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)

This script has been running for the past hour. When I ran it, it gave the following warning:

/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until

I would appreciate it if anyone could advise on a quicker way, perhaps pointers on the method the 'warning' suggests

unutbu · Accepted Answer

Calling a Python function once for each row of a Series can be very slow if the Series is very long. The key to speeding this up is replacing the multiple function calls with (ideally) one vectorized function call.

When using Pandas, that means rewriting the Python function (e.g. urlparse) in terms of vectorized string functions.

Since urlparse is a fairly complicated function, rewriting urlparse would be pretty hard. However, in your case we have the advantage of knowing that all the urls that we care about begin with https:// or http://. So we don't need urlparse in its full-blow generality. We can perhaps make do with a much simpler rule: The netloc is whatever characters follow https:// or http:// until the end of the string or the next /, whichever comes first. If that is true, then

Mexico['Page URL'].str.extract('(https?://[^/]+)', expand=False)

can extract all the netlocs from the entire Series Mexico['Page URL'] without looping and without multiple urlparse function calls. This will be much faster when len(Mexico) is big.

For example,

import pandas as pd

Mexico = pd.DataFrame({'Page URL':['http://www.one.com.mx/furl/Conteúdo Raiz/Meu',
                                   'https://www.one.com.mx/furl/Conteúdo Raiz/Meu']})

Mexico['SubDomain'] = Mexico['Page URL'].str.extract('(https?://[^/]+)', expand=False)
print(Mexico)

yields

                                        Page URL               SubDomain
0   http://www.one.com.mx/furl/Conteúdo Raiz/Meu   http://www.one.com.mx
1  https://www.one.com.mx/furl/Conteúdo Raiz/Meu  https://www.one.com.mx

Pandas function taking too long

Answers (1)

Related Questions