Webscraping - Trying to extract some data, but got stuck at the final hurdle

I've managed to expose the right data (some of it is calculated on the fly in the page so was a bit more complex than I thought) but I now need to get it in a JSON string and despite many attempts I'm stuck!

This Python script is as follows (using Selenium & BeautifulSoup):

from bs4 import BeautifulSoup
from selenium import webdriver
import datetime
from dateutil import parser
import requests
import json

url = 'https://www.braintree.gov.uk/bins-waste-recycling/route-3-collection-dates/1'

browser = webdriver.Chrome(executable_path = r'C:/Users/user/Downloads/chromedriver.exe')
browser.get(url)


html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
soup = BeautifulSoup(html, "html.parser")

data=soup.find_all("div", {"class":"date_display"})
#print(data)
#out = {}


for data in data:
    bin_colour = data.find('h3').text
    bin_date = parser.parse(data.find('p').text).strftime('%Y-%m-%d')
    print(bin_colour)
    print(bin_date)
    print()

browser.quit()

This results in:

Grey Bin 

2021-06-30
   
Green Bin

2021-06-23
   
Clear Sack

2021-06-23

Food Bin

2021-06-23

It might (probably) not be the best code/approach so am open to your suggestions. The main goal is to end up with:

{"Grey Bin": "2021-06-30", "Green Bin": "2021-06-23", "Clear Sack": "2021-06-23", "Food Bin": "2021-06-23"}

Hope this makes sense, I've tried various ways of getting the data into the right format but just seem to lose it all so after many hours of trying I'm hoping you guys can help.

Update: Both of MendelG's solutions worked perfectly. Vitalis's solution gave four outputs, the last being the required output - so thank you both for very quick and working solutions - I was close, but couldn't see the wood for the trees!

Upvotes: 4

Answers (3)

vitaliis

Reputation: 4212

You can create an empty dictionary, add values there and print it.

Solution

from bs4 import BeautifulSoup
from selenium import webdriver
import datetime
from dateutil import parser
import requests
import json

url = 'https://www.braintree.gov.uk/bins-waste-recycling/route-3-collection-dates/1'

browser = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
browser.get(url)


html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
soup = BeautifulSoup(html, "html.parser")

data = soup.find_all("div", {"class":"date_display"})

result = {}

for item in data:
    bin_colour = item.find('h3').text
    bin_date = parser.parse(item.find('p').text).strftime('%Y-%m-%d')
    result[bin_colour]=bin_date

print(result)

OUTPUT

{'Grey Bin': '2021-06-30', 'Green Bin': '2021-06-23', 'Clear Sack': '2021-06-23', 'Food Bin': '2021-06-23'}

You can make it in a similar way if you'll need and an output in list, but you'll need to .append values, similarly as I did here Trouble retrieving elements and looping pages using next page button

If you need a double quotes use this print:

print(json.dumps(result))

It will print:

{"Grey Bin": "2021-06-30", "Green Bin": "2021-06-23", "Clear Sack": "2021-06-23", "Food Bin": "2021-06-23"}

Upvotes: 1

QHarr

Reputation: 84465

You could collect all the listed dates using requests and re. You regex out the various JavaScript objects containing the dates for each collection type. You then need to add 1 to each month value to get month in the range 1-12; which can be done with regex named groups. These can be converted to actual dates for later filtering.

Initially storing all dates in a dictionary with key as collection type and values as a list of collection dates, you can use zip_longest to create a DataFrame. You can then use filtering to find the next collection date for a given collection.

I use a couple of helper functions to achieve this.

import requests
from dateutil import parser
from datetime import datetime
from pandas import to_datetime, DataFrame
from itertools import zip_longest

def get_dates(dates):
    dates = [re.sub(r'(?P<g1>\d+),(?P<g2>\d+),(?P<g3>\d+)$', lambda d: parser.parse('-'.join([d.group('g1'), str(int(d.group('g2')) + 1), d.group('g3')])).strftime('%Y-%m-%d'), i) 
             for i in re.findall(r'Date\((\d{4},\d{1,2},\d{1,2}),', dates)] 
    dates = [datetime.strptime(i, '%Y-%m-%d').date() for i in dates]
    return dates

def get_next_collection(collection, df):
    return df[df[collection] >= to_datetime('today')][collection].iloc[0]
    
collection_types = ['grey', 'green', 'clear', 'food']
r = requests.get('https://www.braintree.gov.uk/bins-waste-recycling/route-3-collection-dates/1')
collections = {}

for collection in collection_types:   
    dates = re.search(r'var {0}(?:(?:bin)|(?:sack)) = (\[.*?\])'.format(collection), r.text, re.S).group(1)
    collections[collection] = get_dates(dates) 

df = DataFrame(zip_longest(collections['grey'], collections['green'], 
                           collections['clear'], collections['food']),
               columns = collection_types)

get_next_collection('grey', df)

You could also use a generator and islice, as detailed by @Martijn Pieters , to work direct of the dictionary entries (holding the collection dates) and limit how many future dates you are interested in e.g.

filtered = (i for i in collections['grey'] if i >= date.today())
list(islice(filtered, 3))

Altered import lines are:

from itertools import zip_longest, islice
from datetime import datetime, date

You then don't need the pandas imports or creation of a DataFrame.

Upvotes: 0

MendelG

Reputation: 20088

To get the data in a dictionary format, you can try:

out = {}
for data in tag:
    out[data.find("h3").text] = parser.parse(data.find("p").text).strftime("%Y-%m-%d")

print(out)

Or, use a dictionary comprehension:

print(
    {
        data.find("h3").text: parser.parse(data.find("p").text).strftime("%Y-%m-%d")
        for data in tag
    }
)

Output:

{'Grey Bin': '2021-06-30', 'Green Bin': '2021-06-23', 'Clear Sack': '2021-06-23', 'Food Bin': '2021-06-23'}

Upvotes: 2

Webscraping - Trying to extract some data, but got stuck at the final hurdle

Answers (3)

Related Questions