Reputation: 119
Hello I have such an html, when I parse it with Beautiful Soup I am not able to select the class text. Think that the problem is in as nested tags are not recognized as children of it. How can I select the span tag text?
Thanks
<div data-component="new_enquiry_form_app" data-props="{"isTelRequired":false,"placement":"top",}">
<section class="enquiry-form-box__wrapper">
<div class="enquiry-form-box enquiry-form-box--inverted">
<form class="enquiry-form-box__form" tabindex="-1">
<fieldset class="enquiry-form-box__wrapper">
<div class="enquiry-form-box__fields">
<div class="k-ns">
<span class="text-gray block mt-3 font-bold text-sm">Property reference: 412</span>
</div>
</div>
</fieldset>
</form>
</div>
</section>
Upvotes: 0
Views: 294
Reputation: 910
Try this:
from bs4 import BeautifulSoup
html = '''<div data-component="new_enquiry_form_app" data-props="{"isTelRequired":false,"placement":"top",}">
<section class="enquiry-form-box__wrapper">
<div class="enquiry-form-box enquiry-form-box--inverted">
<form class="enquiry-form-box__form" tabindex="-1">
<fieldset class="enquiry-form-box__wrapper">
<div class="enquiry-form-box__fields">
<div class="k-ns">
<span class="text-gray block mt-3 font-bold text-sm">Property reference: 412</span>
</div>
</div>
</fieldset>
</form>
</div>
</section>'''
soup = BeautifulSoup(html, 'html.parser')
span = soup.select_one('span.text-gray.block.mt-3.font-bold.text-sm')
print(span.get_text())
prints:
Property reference: 412
Then this is one way:
from selenium import webdriver
driver = webdriver.Firefox(executable_path='c:program/geckodriver')
driver.get('https://www.kyero.com/en/property/7689206-villa-for-sale-sant-joan-de-labritja')
span = driver.find_element_by_css_selector('span.text-gray.block.mt-3.font-bold.text-sm')
print(span.text)
driver.close()
prints:
Property reference: 412
Note yo need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe
@Andrej Kesely was faster with the other answer so i give a selenium answer.
Upvotes: 1
Reputation: 195438
To print the reference label, you can use this script (the data is stored in javascript variable inside the HTML document):
import re
import json
import requests
url = 'https://www.kyero.com/en/property/7689206-villa-for-sale-sant-joan-de-labritja'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
html_text = requests.get(url, headers=headers).text
data = json.loads( re.search(r'window\.initialState = (.*);', html_text).group(1) )
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
print(data['property']['referenceLabel'])
Prints:
Property reference: 412
Upvotes: 1