scrape_noob
scrape_noob

Reputation: 53

Selecting text from soup.find

The following code prints the attached text from the specified class in my web scraper:

import requests
from bs4 import BeautifulSoup
import json
import re

name = []
address = []
price = []
date_published = []

def scrape_site(url):                          
    page = requests.get(url)
    soup = BeautifulSoup(page.text,'html.parser')
    prelim_info_string = soup.find("a", class_= "btn btn-wh-primary")    
    print(prelim_info_string)
<a class="btn btn-wh-primary" data-tealium-action="N" data-tealium-event="similar_ads_last" data-tealium-tms='{"tmsData":{"is_private":"true","ad_type":"Marktplatz","page_type":"Ad_View","region_level_label_1":"Austria","vertical_id":"5","vertical":"Marktplatz","source":"Web","ad_title":"Fahrrad+28+Zoll","num_pictures":"3","category_level_1":"Sport+%2F+Sportger%C3%A4te","region_level_id_2":"9","category_level_3":"Fahrr%C3%A4der","region_level_id_3":"117239","category_level_2":"Fahrr%C3%A4der+%2F+Radsport","seller_name":"Privat","region_level_id_1":"-141","price":"110","product_id":"67","category_level_max":"4","seller_uuid":"671ea1da-8e28-4199-bc2b-e2c91a4831c9","region_level_2":"Wien","region_level_3":"Wien%2C+17.+Bezirk%2C+Hernals","category_level_4":"Citybikes+%2F+Stadtr%C3%A4der","seller_id":"29608611","region_level_1":"AT","ad_type_id":"67","category_level_id_3":"4552","category_level_id_2":"4525","category_level_id_1":"4390","category_level_id_4":"4554","exact_price":"110.0","environment":"web","ad_id":"423313540","post_code":"1170","event_name":"adview","publish_date":"2020-11-18T19%3A05%3A00.000Z"}}' href="/iad/kaufen-und-verkaufen/aehnlichkeitssuche?adId=423313540&amp;imageId=1&amp;ATTRIBUTE_TREE=4554" id="btn-show-more">
        Mehr Anzeigen
    </a>

My goal with this web scraper is to extract the following, append it to the lists and then import them into excel:
"seller_name"
"price":"
"post_code"
and
"publish_date"

I've tried using regex and get.text() but I just can't seem to get that text. All of it is in the class!

Any help here is appreciated, thank you!

edit: here's the website I'm working with. The plan is to loop through the last 100 or so ads and pull all this data once I can get it working.

https://www.willhaben.at/iad/kaufen-und-verkaufen/d/fahrrad-28-zoll-423313540/

Upvotes: 0

Views: 67

Answers (2)

wwii
wwii

Reputation: 23773

Using the attrs tag attribute you can drill down til you encounter a string representation of a dict which can be parsed with ast.

from bs4 import BeautifulSoup
import ast, operator

s = '''<html><a class="btn btn-wh-primary" data-tealium-action="N" data-tealium-event="similar_ads_last" data-tealium-tms='{"tmsData":{"is_private":"true","ad_type":"Marktplatz","page_type":"Ad_View","region_level_label_1":"Austria","vertical_id":"5","vertical":"Marktplatz","source":"Web","ad_title":"Fahrrad+28+Zoll","num_pictures":"3","category_level_1":"Sport+%2F+Sportger%C3%A4te","region_level_id_2":"9","category_level_3":"Fahrr%C3%A4der","region_level_id_3":"117239","category_level_2":"Fahrr%C3%A4der+%2F+Radsport","seller_name":"Privat","region_level_id_1":"-141","price":"110","product_id":"67","category_level_max":"4","seller_uuid":"671ea1da-8e28-4199-bc2b-e2c91a4831c9","region_level_2":"Wien","region_level_3":"Wien%2C+17.+Bezirk%2C+Hernals","category_level_4":"Citybikes+%2F+Stadtr%C3%A4der","seller_id":"29608611","region_level_1":"AT","ad_type_id":"67","category_level_id_3":"4552","category_level_id_2":"4525","category_level_id_1":"4390","category_level_id_4":"4554","exact_price":"110.0","environment":"web","ad_id":"423313540","post_code":"1170","event_name":"adview","publish_date":"2020-11-18T19%3A05%3A00.000Z"}}' href="/iad/kaufen-und-verkaufen/aehnlichkeitssuche?adId=423313540&amp;imageId=1&amp;ATTRIBUTE_TREE=4554" id="btn-show-more">
        Mehr Anzeigen
    </a></html>'''
    
important_stuff = operator.itemgetter("seller_name",
                                      "price",
                                      "post_code" ,
                                      "publish_date")

soup = BeautifulSoup(s,'html.parser')
tag = soup.find('a')

q = ast.literal_eval(tag.attrs['data-tealium-tms'])

In [43]: q['tmsData'].keys()
dict_keys(['is_private', 'ad_type', 'page_type', 'region_level_label_1', 'vertical_id', 'vertical', 'source', 'ad_title', 'num_pictures', 'category_level_1', 'region_level_id_2', 'category_level_3', 'region_level_id_3', 'category_level_2', 'seller_name', 'region_level_id_1', 'price', 'product_id', 'category_level_max', 'seller_uuid', 'region_level_2', 'region_level_3', 'category_level_4', 'seller_id', 'region_level_1', 'ad_type_id', 'category_level_id_3', 'category_level_id_2', 'category_level_id_1', 'category_level_id_4', 'exact_price', 'environment', 'ad_id', 'post_code', 'event_name', 'publish_date'])

In [44]: important_stuff(q['tmsData'])
Out[44]: ('Privat', '110', '1170', '2020-11-18T19%3A05%3A00.000Z')

Upvotes: 1

user14137201
user14137201

Reputation:

Append .text to your variable without the braces. So try this: print(prelim_info_string.text)

Upvotes: 0

Related Questions