Akyna
Akyna

Reputation: 87

Find specific Tag Python BeautifulSoup

Hey I'm trying to extract URLs between 2 tags

This is what i got so far:

html_doc = '<div class="b_attribution" u="1|5075|4778623818559697|b0YAhIRjW_h9ERBLSt80gnn9pWk7S76H"><cite>https://www.developpez.net/forums/d1497343/environnements-developpem...</cite><span class="c_tlbxTrg">'
soup = BeautifulSoup(html_doc, "html.parser")
links = []
for links in soup.findAll('cite'):
print(links.get('cite')) 

I have tried different things but I couldn't extract the URL between <cite>.....</cite>

My code Updated

import requests 
from bs4 import BeautifulSoup as bs

dorks = input("Keyword : ")

binglist = "http://www.bing.com/search?q="
    
with open(dorks , mode="r",encoding="utf-8") as my_file:
    for line in my_file:
        clean = binglist + line
        headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
        r = requests.get(clean, headers=headers)
        soup = bs(r.text, 'html.parser')
        links =  soup.find('cite')
        print(links)

In keyword file you just need to put any keyword like : test games

Thanks for your help

Upvotes: 0

Views: 3010

Answers (3)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

You're looking for this to get links from Bing organic results:

# container with needed data: title, link, snippet, etc.
for result in soup.select(".b_algo"):
    link = result.select_one("h2 a")["href"]

Specifically for example provided by you:

from bs4 import BeautifulSoup

html_doc = '<div class="b_attribution" u="1|5075|4778623818559697|b0YAhIRjW_h9ERBLSt80gnn9pWk7S76H"><cite>https://www.developpez.net/forums/d1497343/environnements-developpem...</cite><span class="c_tlbxTrg">'

soup = BeautifulSoup(html_doc, "html.parser")
link = soup.select_one('.b_attribution cite').text
print(link)

# https://www.developpez.net/forums/d1497343/environnements-developpem...

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"
}

params = {
  "q": "lasagna",
  "hl": "en",
}

html = requests.get("https://www.bing.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, "lxml")

for links in soup.select(".b_algo"):
    link = links.select_one("h2 a")["href"]
    print(link)

------------
'''
https://www.allrecipes.com/recipe/23600/worlds-best-lasagna/
https://www.foodnetwork.com/topics/lasagna
https://www.tasteofhome.com/recipes/best-lasagna/
https://www.simplyrecipes.com/recipes/lasagna/
'''

Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to deal with extraction, maintain, bypass from the blocks part, instead, you only need to iterate over structured JSON and get what you want.

Code to integrate to achieve your goal:

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "bing",
  "q": "lion king"
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
    link = result['link']
    print(link)

------------
'''
https://www.allrecipes.com/recipe/23600/worlds-best-lasagna/
https://www.foodnetwork.com/topics/lasagna
https://www.tasteofhome.com/recipes/best-lasagna/
https://www.simplyrecipes.com/recipes/lasagna/
'''

Disclaimer, I work for SerpApi.

Upvotes: 0

kabooya
kabooya

Reputation: 566

You can do it as follows:

html_doc = '<div class="b_attribution" u="1|5075|4778623818559697|b0YAhIRjW_h9ERBLSt80gnn9pWk7S76H"><cite>https://www.developpez.net/forums/d1497343/environnements-developpem...</cite><span class="c_tlbxTrg">'
soup = BeautifulSoup(html_doc, "html.parser")
links =  soup.find('cite')
for link in links:
    print(link.text) 

You can webscrape Bing as follows:

import requests 
from bs4 import BeautifulSoup as bs

headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
r = requests.get("https://www.bing.com/search?q=test", headers=headers)
soup = bs(r.text, 'html.parser')
links =  soup.find('cite')


for link in links:
    print(link.text) 

This code does the following:

  • With request we get the Web Page we're looking for. We set headers to avoid being blocked by Bing (more information, see: https://oxylabs.io/blog/5-key-http-headers-for-web-scraping)
  • Then we HTML'ify the code, and extract all codetags (this returns a list)
  • For each element in the list, we only want what's inside the codetag, using .text we print the inside of this tag.

Please pay attention to the headers!

Upvotes: 1

Shreyesh Desai
Shreyesh Desai

Reputation: 719

Try this:

html_doc = '<div class="b_attribution" u="1|5075|4778623818559697|b0YAhIRjW_h9ERBLSt80gnn9pWk7S76H"><cite>https://www.developpez.net/forums/d1497343/environnements-developpem...</cite><span class="c_tlbxTrg">'
soup = BeautifulSoup(html_doc, "html.parser")
links =  soup.find_all('cite')
for link in links:
    print(link.text) 

Upvotes: 0

Related Questions