Can't parse weird looking website addresses from some identical links

Question

I'm trying to fetch the website address out of some identical webpages. I've created a regex expression to parse the same but the pattern I've defined is undoubetedly the worst one. How can I get only the website address from a webpage located within p tag under post-content class?.

I've tried with:

import re
import requests
from bs4 import BeautifulSoup

links = [
    'https://colegios.es/2012/santisimo-rosario-mosen-rubi-avila/',
    'https://colegios.es/2012/cra-el-valle-villarejo-del-valle/',
    'https://colegios.es/2012/ceip-las-canadas-trescasas/',
    'https://colegios.es/2012/cra-el-barranco-san-esteban-del-valle/'
]

def get_website(link):
    res = requests.get(link,headers={'User-Agent':'Mozilla/5.0'})
    soup = BeautifulSoup(res.text,"html5lib")
    text = soup.select_one('.post-content > p').get_text(strip=True, separator='\n')
    website = re.findall(r'\s+(.*)\n\[', text)[0]
    print(website)

if __name__ == '__main__':
    for link in links:
        get_website(link)

Result I'm getting:

www3.planalfa.es/stmorosario
centros1.pntic.mec.es/elvalle/webCra
Dirección: Las Pozas, 17 40194 Trescasas Segovia
Tel. 920 383 556 05005887@educa.jcyl.es   centros1.pntic.mec.es/cp.el.barranco

Desired results:

www3.planalfa.es/stmorosario
centros1.pntic.mec.es/elvalle/webCra

centros1.pntic.mec.es/cp.el.barranco

Can't parse weird looking website addresses from some identical links

Answers (1)

Related Questions

Can&#39;t parse weird looking website addresses from some identical links

Answers (1)

Related Questions

Can't parse weird looking website addresses from some identical links