Rohan
Rohan

Reputation: 93

Basic Webscraping with Python (Beautiful Soup & Requests)

So I've been working through Al Sweigart's online Automate The Boring Stuff With Python tutorials, and I've just got to the webscraping part. Here's my code with a description of what the program is supposed to do:

#! python3
# lucky.py - A small program that allows you to get search keywords from
# command line arguments, retrieve the search results page, and open
# a new browser tab for each result

# Steps:
# 1. Read the command line arguments from sys.argv
# 2. Fetch the search result page with the requests module
# 3. Find the links to each search result
# 4. Call the webbrowser.open() function to open the web browser

import sys, requests, bs4, webbrowser

# 1. Read the command line arguments from sys.argv

print('Googling...')

if len(sys.argv) > 1:
    search = ' '.join(sys.argv[1:])

url = "https://www.google.com/#q="

for i in range(len(search.split())):
    url += search.split()[i] + "+"

# 2. Fetch the search result page with the requests module

page = requests.get(url)

# 3. Find the links to each search result

soup = bs4.BeautifulSoup(page.text, 'lxml')
linkElems = soup.select('.r a')

# 4. Call the webbrowser.open() function to open the web browser

numOpen = min(5, len(linkElems))
for i in range(numOpen):
    webbrowser.open("http://google.com" + linkElems[i].get('href'))

So the issue here is that when I check the length of linkElems, it's 0, meaning that the soup.select('.r a') command failed to aggregate the content defined under element <a> inside class=r (a class only used for search results in Google as can be seen when using the developer tools). As a result, no web pages of the search results open up in my browser.

I think the issue has something to do either with the HTML-parser not working correctly, or Google changing the way their HTML code works(?). Any insight into this issue would be greatly appreciated!

Upvotes: 0

Views: 709

Answers (3)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

There's no need in phantomjs or selenium. Also, query param is wrong: #q= -> ?q=.

You can limit selected links by using list slicing:

linkElems = soup.select('.r a')[:5]
# or 
for i in soup.select('.r a')[:5]:
    # other code..

Make sure you're using user-agent because default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you'll receive a different HTML with some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers.

I wrote a dedicated blog about how to reduce chance of being blocked while web scraping search engines that cover multiple solutions.

Pass user-agent in request headers:

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)

Code:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "samurai cop what does katana mean",
  "gl": "us",
  "hl": "en"
}

html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc')[:5]:
  title = result.select_one('.DKV0Md').text
  link = result.select_one('.yuRUbf a')['href']

  print(title, link, sep='\n')

--------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
"It means "Japanese sword"... 2 minute review of a ... - Reddit
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481
What does Katana mean? - Samurai Cop quotes - Subzin.com
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you only need to iterate over structured JSON and get the data you want rather than making everything from scratch and maintain it over time.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "samurai cop what does katana mean",
  "hl": "en",
  "gl": "us",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"][:5]:
  print(result['title'])
  print(result['link'])

---------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
"It means "Japanese sword"... 2 minute review of a ... - Reddit
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481
What does Katana mean? - Samurai Cop quotes - Subzin.com
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''

Disclaimer, I work for SerpApi.

Upvotes: 0

deltaray
deltaray

Reputation: 432

Google seems to be detecting that you're a bot and not a real webbrowser with Cookies and Javascript. What they seem to be trying to do with the new results is still get web scrapers to follow the links they provide and prefix them with https://www.google.com so that when you then go to that URL, they can still track your movement.

You could also try to find a pattern in the link provided. For instance, when you search for 'linux', it returns the following:

/url?q=https://en.wikipedia.org/wiki/Linux&sa=U&ved=9775308e-206b-11e8-b45f-fb72cae612a8&usg=9775308e-206b-11e8-b45f-fb72cae612a8
/url?q=https://www.linux.org/&sa=U&ved=9775308e-206b-11e8-b45f-fb72cae612a8&usg=9775308e-206b-11e8-b45f-fb72cae612a8
/url?q=https://www.linux.com/what-is-linux&sa=U&ved=9775308e-206b-11e8-b45f-fb72cae612a8&usg=d50ea51a-206b-11e8-9432-2bee635f8337
/url?q=https://www.ubuntu.com/&sa=U&ved=9775308e-206b-11e8-b45f-fb72cae612a8&usg=dab9f6a4-206b-11e8-a999-3fc9d4576425
/search?q=linux&ie=UTF-8&prmd=ivns&source=univ&tbm=nws&tbo=u&sa=X&ved=9775308e-206b-11e8-b45f-fb72cae612a8

You could use a regex to grab the part between '/url?q=' and '&sa=U&ved=' as that's the URL that you probably want. Of course, that doesn't work with the 5th result that it returned because it's something special for the Google website. Again, Probably tacking https://www.google.com on the front of each URL returned is the safest thing to do.

Most search engines (even duckduckgo.com) are trying to track search results and clicks. If you try to avoid it they have detection code in place to block you. You may have run into this with Google telling you they've detected a large number of searches from your IP and you have to go through a CAPTCHA test to continue.

Upvotes: 1

PYA
PYA

Reputation: 8636

linkElems = soup.find_all('a',href=True) This returns all the relevant <a> tags and you can process the list to decide what to keep and what not to keep.

Upvotes: 0

Related Questions