Reputation: 1529
I have a list of urls as follows:
urls = [
www.example.com?search?q=Term&page=0,
www.example.com?search?q=Term&page=1,
www.example.com?search?q=Term&page=2
]
Where Term might be whatever term we want: Europe
, London
, etc..
My part of code (among the whole code) is the following:
for url in urls:
file_name = url.replace('http://www.example.com/search?q=','').replace('=','').replace('&','')
file_name = file_name+('.html')
which results in:
Termpage0.html
Termpage1.html
and so on..
How can I strip the Term in the list of urls to result as:
page0.html
page1.html
and so on?
Upvotes: 3
Views: 199
Reputation: 17263
You could use urllib.parse
to parse the URL and then the query part. Benefit of this approach is that it will work the same if order of query parts are changed or new parts are added:
from urllib import parse
urls = [
'www.example.com?search?q=Term&page=0',
'www.example.com?search?q=Term&page=1',
'www.example.com?search?q=Term&page=2'
]
for url in urls:
parts = parse.urlparse(url)
query = parse.parse_qs(parts.query)
print('page{}.html'.format(query['page'][0]))
Output:
page0.html
page1.html
page2.html
In above urlparse
returns ParseResult
object that contains URL components:
>>> from urllib import parse
>>> parts = parse.urlparse('www.example.com/search?q=Term&page=0')
>>> parts
ParseResult(scheme='', netloc='', path='www.example.com/search', params='', query='q=Term&page=0', fragment='')
Then parse_qs
will return dict
of query parameters where values are lists:
>>> query = parse.parse_qs(parts.query)
>>> query
{'page': ['0'], 'q': ['Term']}
Upvotes: 5
Reputation: 43068
Continue with what you were doing and use str.replace
for url in urls:
file_name = url.replace('http://www.example.com/search?q=','').replace('=','').replace('&','').replace('Term', '')
file_name = file_name+('.html')
Example:
>>> urls = ['www.example.com/search?q=Term&page=0', 'www.example.com/search?q=Term&page=1', 'www.example.com/search?q=Term&page=2']
>>> for url in urls:
... file_name = url.replace('www.example.com/search?q=','').replace('=','').replace('&','').replace('Term', '')
... file_name = file_name+('.html')
... print (file_name)
page0.html
page1.html
page2.html
If the terms are random, then use re.sub
like this:
re.sub('Term|Term1|Term2', '', file_name)
Or this if the term isn't known before the program is run:
pattern = re.compile("|".join(map(str.strip, sys.stdin.readlines())))
pattern.sub('', file_name)
Example:
>>> for url in urls:
... file_name = url.replace('www.example.com/search?q=','').replace('=','').replace('&','')
... file_name = re.sub('Term|Term1|Term2', '', file_name)
... file_name = file_name+('.html')
... print file_name
...
page0.html
page1.html
page2.html
Upvotes: 0
Reputation: 249103
If you just want the last part after the last &
, it's easy:
url.split('&')[-1].replace('=', '') + '.html'
Upvotes: 2