Reputation: 41
I've managed to expose the right data (some of it is calculated on the fly in the page so was a bit more complex than I thought) but I now need to get it in a JSON string and despite many attempts I'm stuck!
This Python script is as follows (using Selenium & BeautifulSoup):
from bs4 import BeautifulSoup
from selenium import webdriver
import datetime
from dateutil import parser
import requests
import json
url = 'https://www.braintree.gov.uk/bins-waste-recycling/route-3-collection-dates/1'
browser = webdriver.Chrome(executable_path = r'C:/Users/user/Downloads/chromedriver.exe')
browser.get(url)
html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
soup = BeautifulSoup(html, "html.parser")
data=soup.find_all("div", {"class":"date_display"})
#print(data)
#out = {}
for data in data:
bin_colour = data.find('h3').text
bin_date = parser.parse(data.find('p').text).strftime('%Y-%m-%d')
print(bin_colour)
print(bin_date)
print()
browser.quit()
This results in:
Grey Bin
2021-06-30
Green Bin
2021-06-23
Clear Sack
2021-06-23
Food Bin
2021-06-23
It might (probably) not be the best code/approach so am open to your suggestions. The main goal is to end up with:
{"Grey Bin": "2021-06-30", "Green Bin": "2021-06-23", "Clear Sack": "2021-06-23", "Food Bin": "2021-06-23"}
Hope this makes sense, I've tried various ways of getting the data into the right format but just seem to lose it all so after many hours of trying I'm hoping you guys can help.
Update: Both of MendelG's solutions worked perfectly. Vitalis's solution gave four outputs, the last being the required output - so thank you both for very quick and working solutions - I was close, but couldn't see the wood for the trees!
Upvotes: 4
Views: 157
Reputation: 4212
You can create an empty dictionary, add values there and print it.
Solution
from bs4 import BeautifulSoup
from selenium import webdriver
import datetime
from dateutil import parser
import requests
import json
url = 'https://www.braintree.gov.uk/bins-waste-recycling/route-3-collection-dates/1'
browser = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
browser.get(url)
html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
soup = BeautifulSoup(html, "html.parser")
data = soup.find_all("div", {"class":"date_display"})
result = {}
for item in data:
bin_colour = item.find('h3').text
bin_date = parser.parse(item.find('p').text).strftime('%Y-%m-%d')
result[bin_colour]=bin_date
print(result)
OUTPUT
{'Grey Bin': '2021-06-30', 'Green Bin': '2021-06-23', 'Clear Sack': '2021-06-23', 'Food Bin': '2021-06-23'}
You can make it in a similar way if you'll need and an output in list
, but you'll need to .append
values, similarly as I did here Trouble retrieving elements and looping pages using next page button
If you need a double quotes use this print:
print(json.dumps(result))
It will print:
{"Grey Bin": "2021-06-30", "Green Bin": "2021-06-23", "Clear Sack": "2021-06-23", "Food Bin": "2021-06-23"}
Upvotes: 1
Reputation: 84465
You could collect all the listed dates using requests
and re
. You regex out the various JavaScript objects containing the dates for each collection type. You then need to add 1 to each month value to get month in the range 1-12; which can be done with regex named groups. These can be converted to actual dates for later filtering.
Initially storing all dates in a dictionary with key as collection type and values as a list of collection dates, you can use zip_longest
to create a DataFrame
. You can then use filtering to find the next collection date for a given collection.
I use a couple of helper functions to achieve this.
import requests
from dateutil import parser
from datetime import datetime
from pandas import to_datetime, DataFrame
from itertools import zip_longest
def get_dates(dates):
dates = [re.sub(r'(?P<g1>\d+),(?P<g2>\d+),(?P<g3>\d+)$', lambda d: parser.parse('-'.join([d.group('g1'), str(int(d.group('g2')) + 1), d.group('g3')])).strftime('%Y-%m-%d'), i)
for i in re.findall(r'Date\((\d{4},\d{1,2},\d{1,2}),', dates)]
dates = [datetime.strptime(i, '%Y-%m-%d').date() for i in dates]
return dates
def get_next_collection(collection, df):
return df[df[collection] >= to_datetime('today')][collection].iloc[0]
collection_types = ['grey', 'green', 'clear', 'food']
r = requests.get('https://www.braintree.gov.uk/bins-waste-recycling/route-3-collection-dates/1')
collections = {}
for collection in collection_types:
dates = re.search(r'var {0}(?:(?:bin)|(?:sack)) = (\[.*?\])'.format(collection), r.text, re.S).group(1)
collections[collection] = get_dates(dates)
df = DataFrame(zip_longest(collections['grey'], collections['green'],
collections['clear'], collections['food']),
columns = collection_types)
get_next_collection('grey', df)
You could also use a generator and islice
, as detailed by @Martijn Pieters
, to work direct of the dictionary entries (holding the collection dates) and limit how many future dates you are interested in e.g.
filtered = (i for i in collections['grey'] if i >= date.today())
list(islice(filtered, 3))
Altered import lines are:
from itertools import zip_longest, islice
from datetime import datetime, date
You then don't need the pandas
imports or creation of a DataFrame
.
Upvotes: 0
Reputation: 20088
To get the data in a dictionary format, you can try:
out = {}
for data in tag:
out[data.find("h3").text] = parser.parse(data.find("p").text).strftime("%Y-%m-%d")
print(out)
Or, use a dictionary comprehension:
print(
{
data.find("h3").text: parser.parse(data.find("p").text).strftime("%Y-%m-%d")
for data in tag
}
)
Output:
{'Grey Bin': '2021-06-30', 'Green Bin': '2021-06-23', 'Clear Sack': '2021-06-23', 'Food Bin': '2021-06-23'}
Upvotes: 2