ladybug
ladybug

Reputation: 602

Get information in sub-tags

I'm trying to retrieve information from a site by web scraping. The information I need is found in sub-tabs, but I'm not able to get it

<div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House
 3
 pièces,                                                                                                         
74 m²
 </span>
 <cite>
 New York (11111)
 </cite>
 </div>
</div>,
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 Appartement
 3
 pièces,                                                                                                         
64 m²
 </span>
 <cite>
 Los Angeles (22222)
 </cite>
 </div>
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House
 4
 pièces,                                                                                                         
81 m²
 </span>
 <cite>
 Chicago (33333)
 </cite>
 </div>

I'm trying to get the ad and the city. I tried:

#BeautifulSoup
from bs4 import BeautifulSoup
import requests

#to get: House 3 pièces, 74 m²
ad = [ad.get_text() for ad in soup.find_all("span", class_='ergov3-txtannonce')]  

#to get cities       
cities = [city.get_text() for city in soup.find_all("cite", class_='ergov3-txtannonce')]

My output:

[]
[]

Good output:

["House 3 pièces, 74 m²", "Appartement 3 pièces, 64 m²", "House 4 pièces, 81 m²"]                                                                                                       
["New York (11111)", "Los Angeles (22222)", "Chicago (33333)"]                                                                                                                                                                                                                                                                                                       

Upvotes: 0

Views: 48

Answers (1)

HedgeHog
HedgeHog

Reputation: 25048

Assuming you soup contains the provided HTML select the elements that holds your information and iterate over the ResultSet to scrape the information. avoid multiple lists, try to scrape all information in one go and save it in a more structured way:

...
data = []

for e in soup.select('.ergov3-txtannonce'):
    data.append({
        'title':e.span.get_text(strip=True),
        'city':e.cite.get_text(strip=True)
    })
...

Note: If the elements are not present in your soup, content of website may provided dynamically by JavaScript - This would be predestined for asking a new question with exact this focus

Example
from bs4 import BeautifulSoup

html='''
<div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House 3 pièces, 74 m²
 </span>
 <cite>
 New York (11111)
 </cite>
 </div>
</div>,
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 Appartement 3 pièces, 64 m²
 </span>
 <cite>
 Los Angeles (22222)
 </cite>
 </div>
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House 4 pièces, 81 m²
 </span>
 <cite>
 Chicago (33333)
 </cite>
 </div>
'''
soup = BeautifulSoup(html)

data = []

for e in soup.select('.ergov3-txtannonce'):
    data.append({
        'title':e.span.get_text(strip=True),
        'city':e.cite.get_text(strip=True)
    })

data
Output
[{'title': 'House 3 pièces, 74 m²', 'city': 'New York (11111)'},
 {'title': 'Appartement 3 pièces, 64 m²', 'city': 'Los Angeles (22222)'},
 {'title': 'House 4 pièces, 81 m²', 'city': 'Chicago (33333)'}]

Upvotes: 1

Related Questions