robots.txt
robots.txt

Reputation: 137

Can't parse weird looking website addresses from some identical links

I'm trying to fetch the website address out of some identical webpages. I've created a regex expression to parse the same but the pattern I've defined is undoubetedly the worst one. How can I get only the website address from a webpage located within p tag under post-content class?.

I've tried with:

import re
import requests
from bs4 import BeautifulSoup

links = [
    'https://colegios.es/2012/santisimo-rosario-mosen-rubi-avila/',
    'https://colegios.es/2012/cra-el-valle-villarejo-del-valle/',
    'https://colegios.es/2012/ceip-las-canadas-trescasas/',
    'https://colegios.es/2012/cra-el-barranco-san-esteban-del-valle/'
]

def get_website(link):
    res = requests.get(link,headers={'User-Agent':'Mozilla/5.0'})
    soup = BeautifulSoup(res.text,"html5lib")
    text = soup.select_one('.post-content > p').get_text(strip=True, separator='\n')
    website = re.findall(r'\s+(.*)\n\[', text)[0]
    print(website)

if __name__ == '__main__':
    for link in links:
        get_website(link)

Result I'm getting:

www3.planalfa.es/stmorosario
centros1.pntic.mec.es/elvalle/webCra
Dirección: Las Pozas, 17 40194 Trescasas Segovia
Tel. 920 383 556 [email protected]   centros1.pntic.mec.es/cp.el.barranco

Desired results:

www3.planalfa.es/stmorosario
centros1.pntic.mec.es/elvalle/webCra

centros1.pntic.mec.es/cp.el.barranco

Upvotes: 0

Views: 57

Answers (1)

QHarr
QHarr

Reputation: 84465

I'm sure it won't take long to break the following

import re
import requests
from bs4 import BeautifulSoup

links = [
    'https://colegios.es/2012/santisimo-rosario-mosen-rubi-avila/',
    'https://colegios.es/2012/cra-el-valle-villarejo-del-valle/',
    'https://colegios.es/2012/ceip-las-canadas-trescasas/',
    'https://colegios.es/2012/cra-el-barranco-san-esteban-del-valle/'
]

def get_website(link):
    res = s.get(link,headers={'User-Agent':'Mozilla/5.0'})
    soup = BeautifulSoup(res.text,"html5lib")
    y = str(soup.select_one('.post-content p')).split('<br/>')[-2]
    if 'Dirección' not in y:
        y = re.sub(r'\s{2,}', ' ', y).strip()
        website = y.split(' ')[-1]
        print(website)

if __name__ == '__main__':
    with requests.Session() as s:
        for link in links:
            get_website(link)

Upvotes: 0

Related Questions