Reputation: 2089
I have the csv (or the dataframe) with the content as follows:
date | URLs | Count
-----------------------------------------------------------------------
17-mar-2014 | www.example.com/abcdef&=randstring | 20
10-mar-2016 | www.example.com/xyzabc | 12
14-apr-2015 | www.example.com/abcdef | 11
12-mar-2016 | www.example.com/abcdef/randstring | 30
15-mar-2016 | www.example.com/abcdef | 10
17-feb-2016 | www.example.com/xyzabc&=randstring | 15
17-mar-2016 | www.example.com/abcdef&=someotherrandstring | 12
I want to clean up the column 'URLs' where I want to convert www.example.com/abcdef&=randstring
or www.example.com/abcdef/randstring
to just www.example.com/abcdef, and so on, for all the rows.
I tried to play around with urlparse
library and parse the URLs to combile just the urlparse(url).netloc
along with urlparse(url).path
/query
/params
. But, it tuned out to be inefficient as every URL leads to completely different path/query/params.
Is there any work around for this using pandas? Any hints/ suggestions are much appreciated.
Upvotes: 3
Views: 2720
Reputation: 863791
I think you can use extract
by regex
- filter all string created by a-z
and A-Z
between www
and .com
and also another string starts with /
:
print (df.URLs.str.extract('(www.[a-zA-Z]*.com/[a-zA-Z]*)', expand=False))
0 www.example.com/abcdef
1 www.example.com/xyzabc
2 www.example.com/abcdef
3 www.example.com/abcdef
4 www.example.com/abcdef
5 www.example.com/xyzabc
6 www.example.com/abcdef
Name: URLs, dtype: object
Upvotes: 2
Reputation: 2630
I think it's related with regex more than pandas, try to use pandas.apply to change one column.
import pandas as pd
import re
def clear_url(origin_url):
p = re.compile('(www.example.com/[a-zA-Z]*)')
r = p.search(origin_url)
if r:
return r.groups(1)[0]
else:
return origin_url
d = [
{'id':1, 'url':'www.example.com/abcdef&=randstring'},
{'id':2, 'url':'www.example.com/abcdef'},
{'id':3, 'url':'www.example.com/xyzabc&=randstring'}
]
df = pd.DataFrame(d)
print 'origin_df'
print df
df['url'] = df['url'].apply(clear_url)
print 'new_df'
print df
Output:
origin_df
id url
0 1 www.example.com/abcdef&=randstring
1 2 www.example.com/abcdef
2 3 www.example.com/xyzabc&=randstring
new_df
id url
0 1 www.example.com/abcdef
1 2 www.example.com/abcdef
2 3 www.example.com/xyzabc
Upvotes: 3