Reputation: 137
I'm trying to fetch the website address out of some identical webpages. I've created a regex expression to parse the same but the pattern I've defined is undoubetedly the worst one. How can I get only the website address from a webpage located within p
tag under post-content
class?.
I've tried with:
import re
import requests
from bs4 import BeautifulSoup
links = [
'https://colegios.es/2012/santisimo-rosario-mosen-rubi-avila/',
'https://colegios.es/2012/cra-el-valle-villarejo-del-valle/',
'https://colegios.es/2012/ceip-las-canadas-trescasas/',
'https://colegios.es/2012/cra-el-barranco-san-esteban-del-valle/'
]
def get_website(link):
res = requests.get(link,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,"html5lib")
text = soup.select_one('.post-content > p').get_text(strip=True, separator='\n')
website = re.findall(r'\s+(.*)\n\[', text)[0]
print(website)
if __name__ == '__main__':
for link in links:
get_website(link)
Result I'm getting:
www3.planalfa.es/stmorosario
centros1.pntic.mec.es/elvalle/webCra
Dirección: Las Pozas, 17 40194 Trescasas Segovia
Tel. 920 383 556 [email protected] centros1.pntic.mec.es/cp.el.barranco
Desired results:
www3.planalfa.es/stmorosario
centros1.pntic.mec.es/elvalle/webCra
centros1.pntic.mec.es/cp.el.barranco
Upvotes: 0
Views: 57
Reputation: 84465
I'm sure it won't take long to break the following
import re
import requests
from bs4 import BeautifulSoup
links = [
'https://colegios.es/2012/santisimo-rosario-mosen-rubi-avila/',
'https://colegios.es/2012/cra-el-valle-villarejo-del-valle/',
'https://colegios.es/2012/ceip-las-canadas-trescasas/',
'https://colegios.es/2012/cra-el-barranco-san-esteban-del-valle/'
]
def get_website(link):
res = s.get(link,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,"html5lib")
y = str(soup.select_one('.post-content p')).split('<br/>')[-2]
if 'Dirección' not in y:
y = re.sub(r'\s{2,}', ' ', y).strip()
website = y.split(' ')[-1]
print(website)
if __name__ == '__main__':
with requests.Session() as s:
for link in links:
get_website(link)
Upvotes: 0