pandas_as_pd
pandas_as_pd

Reputation: 55

Why <span> does not contain the text in BeautifulSoup despite the fact that exactly the same <span> from the website contains it?

I have to scrape 3 elements from this website:

http://www.altitude-maps.com/city/170_562,Poznan,Wielkopolskie,Poland

I need latitude, longitude and elevation, so my code is:

import requests
from bs4 import BeautifulSoup as bs

url = 'http://www.altitude-maps.com/city/170_562,Poznan,Wielkopolskie,Poland'
r = requests.get(url)
soup = bs(r.content, features="html.parser")

latitude = soup.find('span', attrs={'id': 'curLat'}).get_text()
longitude = soup.find('span', attrs={'id': 'curLng'}).get_text()
elevation1 = soup.find('span', attrs={'id': 'altitude'}).get_text()  # from the text in the center
elevation2 = soup.find('span', attrs={'id': 'curElevation'}).get_text()  # from the box in the left

It finds values for the latitude and the longitude, but it doesn't for the elevation (in both cases). Instead of getting '80.33 m' and '80.33 m (263.55 ft)' I get white space and empty str.

Comparision of HTML from the BS and from the website:

BS_elevation1 = soup.find('span', attrs={'id': 'altitude'}) 
#  BS_elevation1: <span id="altitude" style="font-size: 1.5em;"> </span>
#  This part on the website: <span id="altitude" style="font-size: 1.5em;">80.33 m (263.55 ft)</span>

BS_elevation2 = soup.find('span', attrs={'id': 'curElevation'})
#  BS_elevation2: <span id="curElevation" style=""></span>
#  This part on the website: <span id="curElevation" style>80.33 m</span>

It seems like the text is available on the website, but it's not available in BeautifulSoup. I can't understand why it happens. How to get over it?

Upvotes: 1

Views: 129

Answers (4)

import httpx
import trio
import re


async def main():
    async with httpx.AsyncClient(timeout=None) as client:
        r = await client.get('http://www.altitude-maps.com/city/170_562,Poznan,Wielkopolskie,Poland')
        goal = re.findall(r"(lati|long|elev).*?'(.+)'", r.text)
        print(goal)

if __name__ == "__main__":
    trio.run(main)

Output:

[('lati', '52.4063740'), ('long', '16.9251681'), ('elev', '80.329216003418')]

Upvotes: 2

QHarr
QHarr

Reputation: 84465

Similar regex idea but using a dictionary comprehension

import re, requests

items = ['latitude', 'longitude', 'elevation']
r = requests.get('http://www.altitude-maps.com/city/170_562,Poznan,Wielkopolskie,Poland').text

data = {item: re.search(r"(?<={}).*'(.*?)'".format(item), r).group(1) for item in items}
print(data)

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195418

The information you're searching for is embedded within the page inside <script> tag, so BeautifulSoup doesn't see it. You can however use re module to parse the page.

To get latitude, longitude and elevation you can use this example:

import re
import requests

url = "http://www.altitude-maps.com/city/170_562,Poznan,Wielkopolskie,Poland"
text = requests.get(url).text

lat = re.search(r"geoplugin_latitude.*?([\d.-]+)", text).group(1)
lon = re.search(r"geoplugin_longitude.*?([\d.-]+)", text).group(1)
elv = re.search(r"geoip_elevation.*?([\d.-]+)", text).group(1)

print("Latitude:", lat)
print("Longitude:", lon)
print("Elevation:", elv)

Prints:

Latitude: 52.4063740
Longitude: 16.9251681
Elevation: 80.329216003418

Upvotes: 1

Tim Roberts
Tim Roberts

Reputation: 54678

Because the elevation is not filled in when the page is presented. Do a "view source" in your browser and you'll see that; it's filled in by Javascript.

Do note, however, that the data you want is all present in the Javascript code in the second Javascript block. That should be pretty easy to parse.

Upvotes: 1

Related Questions