Reputation: 602
I'm trying to retrieve information from a site by web scraping. The information I need is found in sub-tabs, but I'm not able to get it
<div class="ergov3-txtannonce">
<div class="ergov3-h3"><span>
House
3
pièces,
74 m²
</span>
<cite>
New York (11111)
</cite>
</div>
</div>,
<div class="ergov3-txtannonce">
<div class="ergov3-h3"><span>
Appartement
3
pièces,
64 m²
</span>
<cite>
Los Angeles (22222)
</cite>
</div>
<div class="ergov3-txtannonce">
<div class="ergov3-h3"><span>
House
4
pièces,
81 m²
</span>
<cite>
Chicago (33333)
</cite>
</div>
I'm trying to get the ad and the city. I tried:
#BeautifulSoup
from bs4 import BeautifulSoup
import requests
#to get: House 3 pièces, 74 m²
ad = [ad.get_text() for ad in soup.find_all("span", class_='ergov3-txtannonce')]
#to get cities
cities = [city.get_text() for city in soup.find_all("cite", class_='ergov3-txtannonce')]
My output:
[]
[]
Good output:
["House 3 pièces, 74 m²", "Appartement 3 pièces, 64 m²", "House 4 pièces, 81 m²"]
["New York (11111)", "Los Angeles (22222)", "Chicago (33333)"]
Upvotes: 0
Views: 48
Reputation: 25048
Assuming you soup
contains the provided HTML
select the elements that holds your information and iterate over the ResultSet
to scrape the information. avoid multiple lists, try to scrape all information in one go and save it in a more structured way:
...
data = []
for e in soup.select('.ergov3-txtannonce'):
data.append({
'title':e.span.get_text(strip=True),
'city':e.cite.get_text(strip=True)
})
...
Note: If the elements are not present in your soup, content of website may provided dynamically by JavaScript
- This would be predestined for asking a new question with exact this focus
from bs4 import BeautifulSoup
html='''
<div class="ergov3-txtannonce">
<div class="ergov3-h3"><span>
House 3 pièces, 74 m²
</span>
<cite>
New York (11111)
</cite>
</div>
</div>,
<div class="ergov3-txtannonce">
<div class="ergov3-h3"><span>
Appartement 3 pièces, 64 m²
</span>
<cite>
Los Angeles (22222)
</cite>
</div>
<div class="ergov3-txtannonce">
<div class="ergov3-h3"><span>
House 4 pièces, 81 m²
</span>
<cite>
Chicago (33333)
</cite>
</div>
'''
soup = BeautifulSoup(html)
data = []
for e in soup.select('.ergov3-txtannonce'):
data.append({
'title':e.span.get_text(strip=True),
'city':e.cite.get_text(strip=True)
})
data
[{'title': 'House 3 pièces, 74 m²', 'city': 'New York (11111)'},
{'title': 'Appartement 3 pièces, 64 m²', 'city': 'Los Angeles (22222)'},
{'title': 'House 4 pièces, 81 m²', 'city': 'Chicago (33333)'}]
Upvotes: 1