Reputation: 87
Hey I'm trying to extract URLs between 2 tags
This is what i got so far:
html_doc = '<div class="b_attribution" u="1|5075|4778623818559697|b0YAhIRjW_h9ERBLSt80gnn9pWk7S76H"><cite>https://www.developpez.net/forums/d1497343/environnements-developpem...</cite><span class="c_tlbxTrg">'
soup = BeautifulSoup(html_doc, "html.parser")
links = []
for links in soup.findAll('cite'):
print(links.get('cite'))
I have tried different things but I couldn't extract the URL between
<cite>.....</cite>
My code Updated
import requests
from bs4 import BeautifulSoup as bs
dorks = input("Keyword : ")
binglist = "http://www.bing.com/search?q="
with open(dorks , mode="r",encoding="utf-8") as my_file:
for line in my_file:
clean = binglist + line
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
r = requests.get(clean, headers=headers)
soup = bs(r.text, 'html.parser')
links = soup.find('cite')
print(links)
In keyword file you just need to put any keyword like : test games
Thanks for your help
Upvotes: 0
Views: 3010
Reputation: 1724
You're looking for this to get links from Bing organic results:
# container with needed data: title, link, snippet, etc.
for result in soup.select(".b_algo"):
link = result.select_one("h2 a")["href"]
Specifically for example provided by you:
from bs4 import BeautifulSoup
html_doc = '<div class="b_attribution" u="1|5075|4778623818559697|b0YAhIRjW_h9ERBLSt80gnn9pWk7S76H"><cite>https://www.developpez.net/forums/d1497343/environnements-developpem...</cite><span class="c_tlbxTrg">'
soup = BeautifulSoup(html_doc, "html.parser")
link = soup.select_one('.b_attribution cite').text
print(link)
# https://www.developpez.net/forums/d1497343/environnements-developpem...
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"
}
params = {
"q": "lasagna",
"hl": "en",
}
html = requests.get("https://www.bing.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, "lxml")
for links in soup.select(".b_algo"):
link = links.select_one("h2 a")["href"]
print(link)
------------
'''
https://www.allrecipes.com/recipe/23600/worlds-best-lasagna/
https://www.foodnetwork.com/topics/lasagna
https://www.tasteofhome.com/recipes/best-lasagna/
https://www.simplyrecipes.com/recipes/lasagna/
'''
Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with extraction, maintain, bypass from the blocks part, instead, you only need to iterate over structured JSON and get what you want.
Code to integrate to achieve your goal:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "bing",
"q": "lion king"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
------------
'''
https://www.allrecipes.com/recipe/23600/worlds-best-lasagna/
https://www.foodnetwork.com/topics/lasagna
https://www.tasteofhome.com/recipes/best-lasagna/
https://www.simplyrecipes.com/recipes/lasagna/
'''
Disclaimer, I work for SerpApi.
Upvotes: 0
Reputation: 566
You can do it as follows:
html_doc = '<div class="b_attribution" u="1|5075|4778623818559697|b0YAhIRjW_h9ERBLSt80gnn9pWk7S76H"><cite>https://www.developpez.net/forums/d1497343/environnements-developpem...</cite><span class="c_tlbxTrg">'
soup = BeautifulSoup(html_doc, "html.parser")
links = soup.find('cite')
for link in links:
print(link.text)
You can webscrape Bing as follows:
import requests
from bs4 import BeautifulSoup as bs
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
r = requests.get("https://www.bing.com/search?q=test", headers=headers)
soup = bs(r.text, 'html.parser')
links = soup.find('cite')
for link in links:
print(link.text)
This code does the following:
code
tags (this returns a list)code
tag, using .text
we print the inside of this tag.Please pay attention to the headers!
Upvotes: 1
Reputation: 719
Try this:
html_doc = '<div class="b_attribution" u="1|5075|4778623818559697|b0YAhIRjW_h9ERBLSt80gnn9pWk7S76H"><cite>https://www.developpez.net/forums/d1497343/environnements-developpem...</cite><span class="c_tlbxTrg">'
soup = BeautifulSoup(html_doc, "html.parser")
links = soup.find_all('cite')
for link in links:
print(link.text)
Upvotes: 0