Reputation: 137
I'm trying to get four fields from a webpage using python but the problem is the data I'm after are not within any structured html, so I can't find any way to get them individually.
I've tried with:
import re
import requests
from bs4 import BeautifulSoup
link = 'https://colegios.es/2012/cra-la-gaznata-san-bartolome-de-pinares/'
def get_content(link):
res = requests.get(link,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,"lxml")
school_name = soup.select_one("h1 > a").get_text(strip=True)
school_address = soup.find("p",text=re.compile('Dirección:\s*([^"]*?)')).text
school_phone = soup.find("p",text=re.compile('Tel\.\s*(.*?)\s*')).text
print(school_name,school_address,school_phone)
if __name__ == '__main__':
get_content(link)
What I'm getting is really a mess:
CRA La Gaznata San Bartolomé de Pinares CRA La Gaznata Servicios: Jornada contínua, Educación Infantil y Primaria Público Dirección: del Pino, 2 5267 San Bartolomé de Pinares Ávila Tel. 920 270 070 Fax 920 270 070 [email protected] [google-map-v3 addmarkerlist=”del Pino, 2 5267 San Bartolomé de Pinares Ávila {}5-default.png”] CRA La Gaznata Servicios: Jornada contínua, Educación Infantil y Primaria Público Dirección: del Pino, 2 5267 San Bartolomé de Pinares Ávila Tel. 920 270 070 Fax 920 270 070 [email protected] [google-map-v3 addmarkerlist=”del Pino, 2 5267 San Bartolomé de Pinares Ávila {}5-default.png”]
Output I wish to grab (second one is suburb available within name):
CRA La Gaznata
San Bartolomé de Pinares
del Pino, 2 5267 San Bartolomé de Pinares Ávila
920 270 070
How can I get the four fields from that webpage?
Upvotes: 3
Views: 35
Reputation: 195438
The key is changing the parser to html5lib
, that way the <br>
tags will be correctly translated to newlines by get_text()
method - and then it's easier to parse the text with re
:
import re
import requests
from bs4 import BeautifulSoup
link = 'https://colegios.es/2012/cra-la-gaznata-san-bartolome-de-pinares/'
def get_content(link):
res = requests.get(link,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text,"html5lib")
text = soup.select_one('.post-content > p').get_text(strip=True, separator='\n')
school_name, suburb = soup.select_one("h1 > a").get_text(strip=True, separator='\n').split('\n')
school_address = re.findall(r'Dirección:\s*(.*)', text)[0]
school_phone = re.findall(r'Tel\.\s*([\d\s]+\d)', text)[0]
email = re.findall(r'[^\s]+@[^\s]+', text)[0]
print(school_name)
print(suburb)
print(school_address)
print(school_phone)
print(email)
if __name__ == '__main__':
get_content(link)
Prints:
CRA La Gaznata
San Bartolomé de Pinares
del Pino, 2 5267 San Bartolomé de Pinares Ávila
920 270 070
[email protected]
Upvotes: 2