ladybug
ladybug

Reputation: 602

Get urls of a given site from queries

I'm trying to get URLs from a website based on keywords. I want to print only the first 10 results (to avoid the error of many requests)

import urllib
import requests
from bs4 import BeautifulSoup

queries = ["ner", "spacy", "bert", "lda"]

for i in queries:
    reqs = requests.get("https://github.com/search?q=" + str(i))
    soup = BeautifulSoup(reqs.text, 'html.parser')

    for links in soup.select('a'):
        print(links.get('href'))

My output:

https://github.com/
/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2Fsearch&source=header
/features/actions
/features/packages
/features/security
/features/codespaces
/features/copilot
/features/code-review
/features/issues
/features/discussions
/features
https://docs.github.com
https://skills.github.com/

I was looking for a list of links that contain one of these words...

Upvotes: 0

Views: 30

Answers (1)

HedgeHog
HedgeHog

Reputation: 25196

Assuming you only wanna get the links of the results, so simply take the first one from each list item by selecting them more specific:

for e in soup.select('.codesearch-results li'):
    print(e.a.get('href'))

Example

import requests
from bs4 import BeautifulSoup

queries = ["ner", "spacy", "bert", "lda"]

for i in queries:
    reqs = requests.get(f"https://github.com/search?q={i}")
    soup = BeautifulSoup(reqs.text, 'html.parser')

    for e in soup.select('.codesearch-results li'):
        print(e.a.get('href'))

Output

/shiyybua/NER
/ryanoasis/nerd-fonts
/preservim/nerdtree
/bmild/nerf
/wavewangyue/ner
/synalp/NER
/preservim/nerdcommenter
/containerd/nerdctl
/NervJS/nerv
/deeppavlov/ner
/explosion/spaCy
/explosion/spacy-course
/explosion/spacy-models
/explosion/spacy-transformers
/chartbeat-labs/textacy
/susanli2016/NLP-with-Python
/explosion/spacy-streamlit
...

Upvotes: 1

Related Questions