Reputation: 195
I'm using the XPath //*[@href]
to scrape links from a web page. However, I noticed that besides for the actual links on the page Selenium is also scraping javascript:void(0)
.
Why does this happen?
If you run this test on http://google.com you will find 45 links - 3 of them are not actually links but javascript:void(0).
How to fix?
Update: The expected output on http://google.com (as the time of writing this) is the following:
1 = https://www.google.com/images/branding/product/ico/googleg_lodp.ico
2 = https://www.google.com/
3 = https://www.google.com/setprefs?suggon=2&prev=https://www.google.com/?gws_rd%3Dssl&sig=0_OQUUDCX_hZxBr1qNxxxxxxxxxxxEH_4%3D
4 = https://mail.google.com/mail/?tab=wm
5 = https://www.google.com/imghp?hl=en&tab=wi&ei=3NfSWL2xxxxxxxBg&ved=0EKouCBgoAQ
6 = https://www.google.com/intl/en/options/
7 = https://myaccount.google.com/?utm_source=OGB
8 = https://www.google.com/webhp?tab=ww&ei=3NfSWL2DKpxxxxxxxg&ved=0EKkuCAIoAQ
9 = https://maps.google.com/maps?hl=en&tab=wl
10 = https://www.youtube.com/
11 = https://play.google.com/?hl=en&tab=w8
12 = https://news.google.com/nwshp?hl=en&tab=wn&ei=3NfSWL2xxxxxxxxxxxBg&ved=0EKkuCAYoBQ
13 = https://mail.google.com/mail/?tab=wm
14 = https://drive.google.com/?tab=wo
15 = https://www.google.com/calendar?tab=wc
16 = https://plus.google.com/?gpsrc=ogpy0&tab=wX
17 = https://translate.google.com/?hl=en&tab=wT
18 = https://photos.google.com/?tab=wq&pageId=none
19 = https://www.google.com/intl/en/options/
20 = http://www.google.com/shopping?hl=en&tab=wf&ei=3NxxxxxxxxxTYBg&ved=0EKkuCA0oDA
21 = https://wallet.google.com/?tab=wa
22 = https://www.google.com/finance?tab=we
23 = https://docs.google.com/document/?usp=docs_alc
24 = https://books.google.com/bkshp?hl=en&tab=wp&ei=3NfSWL2xxxxxxxxxxxBg&ved=0EKkuCBEoEA
25 = https://www.blogger.com/?tab=wj
26 = https://www.google.com/contacts/?hl=en&tab=wC
27 = https://hangouts.google.com/
28 = https://keep.google.com/
29 = https://www.google.com/intl/en/options/
30 = https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.com/%3Fgws_rd%3Dssl
----------------------- (removed)
32 = https://www.google.com/url?q=https://www.google.com/intl/en_us/homepage/search/sp-firefox.html%3Futm_source%3Dgoogle.com%26utm_medium%3Dpushdown%26utm_content%3Dswitch%26utm_campaign%3Dffdse&source=hpp&id=190xx319&ct=7&usg=AFxxxxxxxxbZR_QouKfSxxxxxxxuQ&cot=2
33 = https://www.google.com/webhp?hl=en&sa=X&ved=0ahUKxxxxxy8-rSAxxxxxxxx8QPAgD
34 = https://support.google.com/websearch/answer/186645?hl=en
35 = https://www.google.com/intl/en/policies/privacy/?fg=1
36 = https://www.google.com/intl/en/policies/terms/?fg=1
37 = https://www.google.com/preferences?hl=en
38 = https://www.google.com/preferences?hl=en&fg=1
39 = https://www.google.com/advanced_search?hl=en&fg=1
40 = https://www.google.com/history/optout?hl=en&fg=1
41 = https://support.google.com/websearch/?p=ws_results_help&hl=en&fg=1
------------------ (removed)
43 = https://www.google.com/intl/en/ads/?fg=1
44 = https://www.google.com/services/?fg=1
45 = https://www.google.com/intl/en/about.html?fg=1
Upvotes: 0
Views: 852
Reputation: 3718
If you really need to get all elements with "href" attribute except these 2 links then you can use the next xPath:
//*[@href][not(contains(@href,'javascript:void'))]
Upvotes: 2
Reputation: 52685
If you want to get anchors that contains URL
as reference you might use below XPath
expression:
"//a[starts-with(@href, 'http')]"
This should cover URLs
with both http
and https
schemas
Upvotes: 1