Reputation: 121
I'm scraping Google search results.Here's my code part.
def select_wholePage(driver):
items = driver.find_elements_by_xpath('//*[@id="main"]/div')
assert isinstance(items, object)
return items
def get_result(item_in):
try:
title = item_in.find_element_by_xpath('.//div/div/a/h3/div').text
print(title)
except exceptions.NoSuchElementException:
return
try:
link = item_in.find_element_by_xpath('.//div/div/a').get_attribute('href')
print(link)
except exceptions.NoSuchElementException:
return
result = (title, link)
return result
Output
-> I can get the desired elements, but when I print the link, "https://www.google.com/url?q=" is attached as shown below.
"https://www.google.com/url?q="
How to remove it?
Upvotes: 0
Views: 207
Reputation: 4779
You could strip off that string using lstrip()
.
s = "https://www.google.com/url?q=<some_query>"
s = s.lstrip("https://www.google.com/url?q=")
print(s)
<some_query>
Upvotes: 2
Reputation: 12721
Don't know if it's the cleanest way but you could do something like this:
google_url_prefix = "https://www.google.com/url?q="
url_cut_id = len(google_url_prefix)
link = link[url_cut_id:]
Upvotes: 1
Reputation: 36590
If https://www.google.com/url?q=
is fixed and always present .replace
method should suffice, i.e.:
encased = "https://www.google.com/url?q=https://www.example.com"
core = encased.replace("https://www.google.com/url?q=", "", 1)
print(core)
output
https://www.example.com
I provided third argument to .replace
which limits it to at most 1 replacement, in case https://www.google.com/url?q=
would appear further.
Upvotes: 1