Reputation: 6965

Using Beautiful Soup to count links on requested page

This should be fairly straightforward. I want to count the links created from a search on a webpage. In this example, search for "gwen stefani" on Stack Overflow. As of the time of writing, the number of results is 15.

import bs4 #  beautiful soup 4
import requests
import webbrowser

url = "https://stackoverflow.com/search?q=gwen+stefani"

myURL = url
webbrowser.open(myURL)

page = requests.get(url).text
r = requests.get(myURL)
html_content = r.text

soup = bs4.BeautifulSoup(html_content, "html.parser")

print soup.title

for link in soup.find_all("a"):
    print(link.get("href"))

When the links are printed out, it doesn't contain any of the results mentioned. I'm new to the soup, and I'm not sure where I'm going wrong at this point.

Upvotes: 3

Answers (2)

hygull

Reputation: 8740

You can also try below code where you do not need to use the class of div element.

Just inspect the page and find the class of question's link.

import bs4 #  beautiful soup 4
import requests
import webbrowser
import json

url = "https://stackoverflow.com/search?q=gwen+stefani"

webbrowser.open(url)

r = requests.get(url)
html_content = r.text

# with open('response.html', 'w', encoding="utf-8") as f:
#   f.write(html_content)

soup = bs4.BeautifulSoup(html_content, "html.parser")

print(soup.title)
links = soup.find_all("a", class_='question-hyperlink')

valid_links = {}

for i, link in enumerate(links):
    href = link.get('href').strip()

    if href.startswith('/questions/'):
        valid_links[href] = link.text.strip()

print(json.dumps(valid_links, indent=4)) # pretty printing dictionary
print(len(valid_links)) # 15

Output

<title>Posts containing 'gwen stefani' - Stack Overflow</title>
{
    "/questions/39268369/what-does-minus-minus-do-in-excel": "Q: What does \u2014 (minus minus) do in Excel? [duplicate]",
    "/questions/53264513/using-beautiful-soup-to-count-links-on-requested-page": "Q: Using Beautiful Soup to count links on requested page",
    "/questions/31074289/is-there-a-script-that-can-transfer-text-from-an-excel-file-into-an-adobe-design/31100563#31100563": "A: Is there a script that can transfer text from an excel file into an Adobe design program?",
    "/questions/39268369/what-does-minus-minus-do-in-excel/39268800#39268800": "A: What does \u2014 (minus minus) do in Excel?",
    "/questions/1668447/launch-failed-binary-not-found-snow-leopard-and-eclipse-c-c-ide-issue/8463357#8463357": "A: \u201cLaunch Failed. Binary Not Found.\u201d Snow Leopard and Eclipse C/C++ IDE issue",
    "/questions/33023818/split-and-rejoin-path-without-trailing-backslash": "Q: Split and rejoin path without trailing backslash",
    "/questions/36986461/regex-match-return-remaining-rest-of-string": "Q: Regex match, return remaining rest of string",
    "/questions/44686123/pass-variable-from-javascript-to-windows-batch-file": "Q: Pass variable from JavaScript to Windows batch file",
    "/questions/44686123/pass-variable-from-javascript-to-windows-batch-file/44686309#44686309": "A: Pass variable from JavaScript to Windows batch file",
    "/questions/52465425/reversing-a-list-with-single-element-gives-none": "Q: Reversing a list with single element gives None [duplicate]",
    "/questions/22196612/array-length-outside-of-a-method": "Q: Array length outside of a method",
    "/questions/13300815/not-getting-expected-results-from-select-query/13300920#13300920": "A: Not getting expected results from SELECT query",
    "/questions/32884087/slicing-string-from-start": "Q: Slicing string from start [duplicate]",
    "/questions/53264513/using-beautiful-soup-to-count-links-on-requested-page/53265048#53265048": "A: Using Beautiful Soup to count links on requested page",
    "/questions/23337218/recursive-conditions-missing-base-case": "Q: Recursive conditions - missing base case"
}
15

Upvotes: 0

Kamikaze_goldfish

Reputation: 861

I'm using python 3.x so you might have to adjust for that but I am getting all 15 links.

from bs4 import BeautifulSoup
import requests

url = 'https://stackoverflow.com/search?q=gwen+stefani'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'hmtl.parser')
for link in soup.findAll('div', class_='result-link'):
    print('https://stackoverflow.com'+link.a['href'])

Upvotes: 2

Using Beautiful Soup to count links on requested page

Answers (2)

Related Questions