Reputation: 71
My file has differents urls:
www.example.com
www.example.com/validagain
www.example.com/search?q=jsdajasj;kdas --> trying to get rid off
www.example.com/anothervalid
I was able to isolate the /search
using regex:
import re
generate_links = re.compile('http://(.*)') #compile all http links
generate_links2 = re.compile('(.*)/eng/(.*)') #compile all english url
with open ("VAC\queue.txt", "r") as queued_list, open('newqueue.txt','w') as queued_list_updated:
for links in queued_list:
url = ""
services_url = ""
valid_url = ""
match = generate_links2.search(links)
if match is not None:
url = match.group()
generate_links3 = re.compile('(.*)/services/(.*)') #compile all services links
match2 = generate_links3.search(links)
if match2 is not None:
services_url = match2.group()
print services_url
generate_links4 = re.compile('(.*)/search?(.*)') #compiled error links
match3 = generate_links4.search(links) #matched all error links
But how do I use match3
variable back to services_url
to remove itself or be replaced?
So the expected results would be:
www.example.com
www.example.com/validagain
www.example.com/anothervalid
Upvotes: 0
Views: 65
Reputation: 165
If you want to get rid of url containing 'search?' try :
from __future__ import print_function
with open() as in, open() as out:
cured_url = [l for l in in.readlines() if 'search?' not in l]
for url in cured_url:
print(url, file=out)
Upvotes: 1