kingmakerking
kingmakerking

Reputation: 2089

Cleaning up URL column in pandas dataframe

I have the csv (or the dataframe) with the content as follows:

date          | URLs                                         | Count
-----------------------------------------------------------------------
17-mar-2014   | www.example.com/abcdef&=randstring           | 20
10-mar-2016   | www.example.com/xyzabc                       | 12
14-apr-2015   | www.example.com/abcdef                       | 11
12-mar-2016   | www.example.com/abcdef/randstring            | 30
15-mar-2016   | www.example.com/abcdef                       | 10
17-feb-2016   | www.example.com/xyzabc&=randstring           | 15
17-mar-2016   | www.example.com/abcdef&=someotherrandstring  | 12

I want to clean up the column 'URLs' where I want to convert www.example.com/abcdef&=randstring or www.example.com/abcdef/randstring to just www.example.com/abcdef, and so on, for all the rows.

I tried to play around with urlparse library and parse the URLs to combile just the urlparse(url).netloc along with urlparse(url).path/query/params. But, it tuned out to be inefficient as every URL leads to completely different path/query/params.

Is there any work around for this using pandas? Any hints/ suggestions are much appreciated.

Upvotes: 3

Views: 2720

Answers (2)

jezrael
jezrael

Reputation: 863791

I think you can use extract by regex - filter all string created by a-z and A-Z between www and .com and also another string starts with /:

print (df.URLs.str.extract('(www.[a-zA-Z]*.com/[a-zA-Z]*)', expand=False))
0    www.example.com/abcdef
1    www.example.com/xyzabc
2    www.example.com/abcdef
3    www.example.com/abcdef
4    www.example.com/abcdef
5    www.example.com/xyzabc
6    www.example.com/abcdef
Name: URLs, dtype: object

Upvotes: 2

linpingta
linpingta

Reputation: 2630

I think it's related with regex more than pandas, try to use pandas.apply to change one column.

import pandas as pd
import re

def clear_url(origin_url):
    p = re.compile('(www.example.com/[a-zA-Z]*)')
    r = p.search(origin_url)
    if r:
        return r.groups(1)[0]
    else:
        return origin_url


d = [
    {'id':1, 'url':'www.example.com/abcdef&=randstring'},
    {'id':2, 'url':'www.example.com/abcdef'},
    {'id':3, 'url':'www.example.com/xyzabc&=randstring'}
]
df = pd.DataFrame(d)

print 'origin_df'
print df

df['url'] = df['url'].apply(clear_url)
print 'new_df'
print df

Output:

origin_df
id                                 url
0   1  www.example.com/abcdef&=randstring
1   2              www.example.com/abcdef
2   3  www.example.com/xyzabc&=randstring
new_df
id                     url
0   1  www.example.com/abcdef
1   2  www.example.com/abcdef
2   3  www.example.com/xyzabc

Upvotes: 3

Related Questions