Manikiran
Manikiran

Reputation: 1

Python Requests Google Custom Site Search Without API

I'm trying to create a webscraper which will get links from Google search result page. Everything works fine, but I want to search a specific site only, i.e., instead of test, I want to search for site:example.com test. The following is my current code:

import requests,re
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs

s_term=input("Enter search term: ").replace(" ","+")
r = requests.get('http://www.google.com/search', params={'q':'"'+s_term+'"','num':"50","tbs":"li:1"})

soup = BeautifulSoup(r.content,"html.parser")

links = []
for item in soup.find_all('h3', attrs={'class' : 'r'}):
    links.append(item.a['href'])

print(links)

I tried using: ...params={'q':'"site%3Aexample.com+'+s_term+'"'... but it returns 0 results.

Upvotes: 0

Views: 762

Answers (2)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

You only need "q" params. Also, make sure you're using user-agent because Google might block your requests eventually thus you'll receive a completely different HTML. I already answered what is user-agent here.

Pass params:

params = {
  "q": "site:example.com test"
}

requests.get("YOUR_URL", params=params)

Pass user-agent:

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get(YOUR_URL, headers=headers)

Code and full example in the online IDE:

from bs4 import BeautifulSoup
import requests

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "site:example.com test"
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  print(link)

# http://example.com/

Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to figure out how to make stuff work since it's already done for the end-user and the only thing that needs to be done is to iterate over structured JSON and get what you want.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "site:example.com test",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['link'])

# http://example.com/

Disclaimer, I work for SerpApi.

Upvotes: 1

SIM
SIM

Reputation: 22440

Change your existing params to the below one:

params={"source":"hp","q":"site:example.com test","oq":"site:example.com test","gs_l":"psy-ab.12...10773.10773.0.22438.3.2.0.0.0.0.135.221.1j1.2.0....0...1.2.64.psy-ab..1.1.135.6..35i39k1.zWoG6dpBC3U"}

Upvotes: 2

Related Questions