Reputation: 602
I'm trying to get URLs from a website based on keywords. I want to print only the first 10 results (to avoid the error of many requests)
import urllib
import requests
from bs4 import BeautifulSoup
queries = ["ner", "spacy", "bert", "lda"]
for i in queries:
reqs = requests.get("https://github.com/search?q=" + str(i))
soup = BeautifulSoup(reqs.text, 'html.parser')
for links in soup.select('a'):
print(links.get('href'))
My output:
https://github.com/
/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2Fsearch&source=header
/features/actions
/features/packages
/features/security
/features/codespaces
/features/copilot
/features/code-review
/features/issues
/features/discussions
/features
https://docs.github.com
https://skills.github.com/
I was looking for a list of links that contain one of these words...
Upvotes: 0
Views: 30
Reputation: 25196
Assuming you only wanna get the links of the results, so simply take the first one from each list item by selecting them more specific:
for e in soup.select('.codesearch-results li'):
print(e.a.get('href'))
import requests
from bs4 import BeautifulSoup
queries = ["ner", "spacy", "bert", "lda"]
for i in queries:
reqs = requests.get(f"https://github.com/search?q={i}")
soup = BeautifulSoup(reqs.text, 'html.parser')
for e in soup.select('.codesearch-results li'):
print(e.a.get('href'))
/shiyybua/NER
/ryanoasis/nerd-fonts
/preservim/nerdtree
/bmild/nerf
/wavewangyue/ner
/synalp/NER
/preservim/nerdcommenter
/containerd/nerdctl
/NervJS/nerv
/deeppavlov/ner
/explosion/spaCy
/explosion/spacy-course
/explosion/spacy-models
/explosion/spacy-transformers
/chartbeat-labs/textacy
/susanli2016/NLP-with-Python
/explosion/spacy-streamlit
...
Upvotes: 1