Yannis Dran
Yannis Dran

Reputation: 1529

Strip random characters from url

I have a list of urls as follows:

urls = [
www.example.com?search?q=Term&page=0,
www.example.com?search?q=Term&page=1,
www.example.com?search?q=Term&page=2
]

Where Term might be whatever term we want: Europe, London, etc..

My part of code (among the whole code) is the following:

for url in urls:
  file_name = url.replace('http://www.example.com/search?q=','').replace('=','').replace('&','')
  file_name = file_name+('.html')

which results in:

Termpage0.html
Termpage1.html
and so on..

How can I strip the Term in the list of urls to result as:

page0.html
page1.html
and so on?

Upvotes: 3

Views: 199

Answers (3)

niemmi
niemmi

Reputation: 17263

You could use urllib.parse to parse the URL and then the query part. Benefit of this approach is that it will work the same if order of query parts are changed or new parts are added:

from urllib import parse

urls = [
    'www.example.com?search?q=Term&page=0',
    'www.example.com?search?q=Term&page=1',
    'www.example.com?search?q=Term&page=2'
]

for url in urls:
    parts = parse.urlparse(url)
    query = parse.parse_qs(parts.query)
    print('page{}.html'.format(query['page'][0]))

Output:

page0.html
page1.html
page2.html

In above urlparse returns ParseResult object that contains URL components:

>>> from urllib import parse
>>> parts = parse.urlparse('www.example.com/search?q=Term&page=0')
>>> parts
ParseResult(scheme='', netloc='', path='www.example.com/search', params='', query='q=Term&page=0', fragment='')

Then parse_qs will return dict of query parameters where values are lists:

>>> query = parse.parse_qs(parts.query)
>>> query
{'page': ['0'], 'q': ['Term']}

Upvotes: 5

smac89
smac89

Reputation: 43068

Continue with what you were doing and use str.replace

for url in urls:
  file_name = url.replace('http://www.example.com/search?q=','').replace('=','').replace('&','').replace('Term', '')
  file_name = file_name+('.html')

Example:

>>> urls = ['www.example.com/search?q=Term&page=0', 'www.example.com/search?q=Term&page=1', 'www.example.com/search?q=Term&page=2']
>>> for url in urls:
...   file_name = url.replace('www.example.com/search?q=','').replace('=','').replace('&','').replace('Term', '')
...   file_name = file_name+('.html')
...   print (file_name)
page0.html
page1.html
page2.html

If the terms are random, then use re.sub like this:

re.sub('Term|Term1|Term2', '', file_name)

Or this if the term isn't known before the program is run:

pattern = re.compile("|".join(map(str.strip, sys.stdin.readlines())))
pattern.sub('', file_name)

Example:

>>> for url in urls:
...   file_name = url.replace('www.example.com/search?q=','').replace('=','').replace('&','')
...   file_name = re.sub('Term|Term1|Term2', '', file_name)
...   file_name = file_name+('.html')
...   print file_name
... 
page0.html
page1.html
page2.html

Upvotes: 0

John Zwinck
John Zwinck

Reputation: 249103

If you just want the last part after the last &, it's easy:

url.split('&')[-1].replace('=', '') + '.html'

Upvotes: 2

Related Questions