Reputation: 53
The following code prints the attached text from the specified class in my web scraper:
import requests
from bs4 import BeautifulSoup
import json
import re
name = []
address = []
price = []
date_published = []
def scrape_site(url):
page = requests.get(url)
soup = BeautifulSoup(page.text,'html.parser')
prelim_info_string = soup.find("a", class_= "btn btn-wh-primary")
print(prelim_info_string)
<a class="btn btn-wh-primary" data-tealium-action="N" data-tealium-event="similar_ads_last" data-tealium-tms='{"tmsData":{"is_private":"true","ad_type":"Marktplatz","page_type":"Ad_View","region_level_label_1":"Austria","vertical_id":"5","vertical":"Marktplatz","source":"Web","ad_title":"Fahrrad+28+Zoll","num_pictures":"3","category_level_1":"Sport+%2F+Sportger%C3%A4te","region_level_id_2":"9","category_level_3":"Fahrr%C3%A4der","region_level_id_3":"117239","category_level_2":"Fahrr%C3%A4der+%2F+Radsport","seller_name":"Privat","region_level_id_1":"-141","price":"110","product_id":"67","category_level_max":"4","seller_uuid":"671ea1da-8e28-4199-bc2b-e2c91a4831c9","region_level_2":"Wien","region_level_3":"Wien%2C+17.+Bezirk%2C+Hernals","category_level_4":"Citybikes+%2F+Stadtr%C3%A4der","seller_id":"29608611","region_level_1":"AT","ad_type_id":"67","category_level_id_3":"4552","category_level_id_2":"4525","category_level_id_1":"4390","category_level_id_4":"4554","exact_price":"110.0","environment":"web","ad_id":"423313540","post_code":"1170","event_name":"adview","publish_date":"2020-11-18T19%3A05%3A00.000Z"}}' href="/iad/kaufen-und-verkaufen/aehnlichkeitssuche?adId=423313540&imageId=1&ATTRIBUTE_TREE=4554" id="btn-show-more">
Mehr Anzeigen
</a>
My goal with this web scraper is to extract the following, append it to the lists and then import them into excel:
"seller_name"
"price":"
"post_code"
and
"publish_date"
I've tried using regex and get.text()
but I just can't seem to get that text. All of it is in the class!
Any help here is appreciated, thank you!
edit: here's the website I'm working with. The plan is to loop through the last 100 or so ads and pull all this data once I can get it working.
https://www.willhaben.at/iad/kaufen-und-verkaufen/d/fahrrad-28-zoll-423313540/
Upvotes: 0
Views: 67
Reputation: 23773
Using the attrs
tag attribute you can drill down til you encounter a string representation of a dict which can be parsed with ast
.
from bs4 import BeautifulSoup
import ast, operator
s = '''<html><a class="btn btn-wh-primary" data-tealium-action="N" data-tealium-event="similar_ads_last" data-tealium-tms='{"tmsData":{"is_private":"true","ad_type":"Marktplatz","page_type":"Ad_View","region_level_label_1":"Austria","vertical_id":"5","vertical":"Marktplatz","source":"Web","ad_title":"Fahrrad+28+Zoll","num_pictures":"3","category_level_1":"Sport+%2F+Sportger%C3%A4te","region_level_id_2":"9","category_level_3":"Fahrr%C3%A4der","region_level_id_3":"117239","category_level_2":"Fahrr%C3%A4der+%2F+Radsport","seller_name":"Privat","region_level_id_1":"-141","price":"110","product_id":"67","category_level_max":"4","seller_uuid":"671ea1da-8e28-4199-bc2b-e2c91a4831c9","region_level_2":"Wien","region_level_3":"Wien%2C+17.+Bezirk%2C+Hernals","category_level_4":"Citybikes+%2F+Stadtr%C3%A4der","seller_id":"29608611","region_level_1":"AT","ad_type_id":"67","category_level_id_3":"4552","category_level_id_2":"4525","category_level_id_1":"4390","category_level_id_4":"4554","exact_price":"110.0","environment":"web","ad_id":"423313540","post_code":"1170","event_name":"adview","publish_date":"2020-11-18T19%3A05%3A00.000Z"}}' href="/iad/kaufen-und-verkaufen/aehnlichkeitssuche?adId=423313540&imageId=1&ATTRIBUTE_TREE=4554" id="btn-show-more">
Mehr Anzeigen
</a></html>'''
important_stuff = operator.itemgetter("seller_name",
"price",
"post_code" ,
"publish_date")
soup = BeautifulSoup(s,'html.parser')
tag = soup.find('a')
q = ast.literal_eval(tag.attrs['data-tealium-tms'])
In [43]: q['tmsData'].keys()
dict_keys(['is_private', 'ad_type', 'page_type', 'region_level_label_1', 'vertical_id', 'vertical', 'source', 'ad_title', 'num_pictures', 'category_level_1', 'region_level_id_2', 'category_level_3', 'region_level_id_3', 'category_level_2', 'seller_name', 'region_level_id_1', 'price', 'product_id', 'category_level_max', 'seller_uuid', 'region_level_2', 'region_level_3', 'category_level_4', 'seller_id', 'region_level_1', 'ad_type_id', 'category_level_id_3', 'category_level_id_2', 'category_level_id_1', 'category_level_id_4', 'exact_price', 'environment', 'ad_id', 'post_code', 'event_name', 'publish_date'])
In [44]: important_stuff(q['tmsData'])
Out[44]: ('Privat', '110', '1170', '2020-11-18T19%3A05%3A00.000Z')
Upvotes: 1
Reputation:
Append .text
to your variable without the braces. So try this: print(prelim_info_string.text)
Upvotes: 0