shamilpython
shamilpython

Reputation: 485

Scraping site returns different href for a link

In python, I'm using the requests module and BS4 to search the web with duckduckgo.com. I went to http://duckduckgo.com/html/?q='hello' manually and got the first results title as <a class="result__a" href="http://example.com"> using the Developer Tools. Now I used the following code to get the href with Python:

html = requests.get('http://duckduckgo.com/html/?q=hello').content
soup = BeautifulSoup4(html, 'html.parser')
result = soup.find('a', class_='result__a')['href']

However, the href looks like gibberish and is completely different from the one i saw manually. ny idea why this is happening?

Upvotes: 2

Views: 283

Answers (1)

Vishnudev Krishnadas
Vishnudev Krishnadas

Reputation: 10960

There are multiple DOM elements with the classname 'result__a'. So, don't expect the first link you see be the first you get.

The 'gibberish' you mentioned is an encoded URL. You'll need to decode and parse it to get the parameters(params) of the URL.

For example: "/l/?kh=-1&uddg=https%3A%2F%2Fwww.example.com"

The above href contains two params, namely kh and uddg. uddg is the actual link you need I suppose.

Below code will get all the URL of that particular class, unquoted.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs, unquote
html = requests.get('http://duckduckgo.com/html/?q=hello').content
soup = BeautifulSoup(html, 'html.parser')
for anchor in soup.find_all('a', attrs={'class':'result__a'}):
  link = anchor.get('href')
  url_obj = urlparse(link)
  parsed_url = parse_qs(url_obj.query).get('uddg', '')
  if parsed_url:
    print(unquote(parsed_url[0]))

Upvotes: 1

Related Questions