Scraping site returns different href for a link

Question

In python, I'm using the requests module and BS4 to search the web with duckduckgo.com. I went to http://duckduckgo.com/html/?q='hello' manually and got the first results title as using the Developer Tools. Now I used the following code to get the href with Python:

html = requests.get('http://duckduckgo.com/html/?q=hello').content
soup = BeautifulSoup4(html, 'html.parser')
result = soup.find('a', class_='result__a')['href']

However, the href looks like gibberish and is completely different from the one i saw manually. ny idea why this is happening?

Vishnudev Krishnadas · Accepted Answer

There are multiple DOM elements with the classname 'result__a'. So, don't expect the first link you see be the first you get.

The 'gibberish' you mentioned is an encoded URL. You'll need to decode and parse it to get the parameters(params) of the URL.

For example: "/l/?kh=-1&uddg=https%3A%2F%2Fwww.example.com"

The above href contains two params, namely kh and uddg. uddg is the actual link you need I suppose.

Below code will get all the URL of that particular class, unquoted.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs, unquote
html = requests.get('http://duckduckgo.com/html/?q=hello').content
soup = BeautifulSoup(html, 'html.parser')
for anchor in soup.find_all('a', attrs={'class':'result__a'}):
  link = anchor.get('href')
  url_obj = urlparse(link)
  parsed_url = parse_qs(url_obj.query).get('uddg', '')
  if parsed_url:
    print(unquote(parsed_url[0]))

Scraping site returns different href for a link

Answers (1)

Related Questions