Reputation: 19
I am very new to python programming. Emphasis on VERY. I am trying to set up my first web scraping project (for news article curation).
I have already managed to scrape the news site and to create a loop that organizes the results how I want them. My issue is that I plan on scraping the web page once a day, but only for the publications that were published that same day. I don't want all of them because that would mean I would get a lot of repetition.
I know that it has something to do with converting the date via the datetime module (with an if statement), but for the life of me I couldn't find a way to make it work.
In the html, this is an example of how the date is displayed:
<time datetime="2019-02-24T10:30:46+00:00">Feb 24, 2019 at 10:30</time>
This is what I have so far:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from datetime import datetime
my_url = "https://www.coindesk.com/category/business-news/legal"
# Opening up the website, grabbing the page
uFeedOne = uReq(my_url, timeout=5)
page_one = uFeedOne.read()
uFeedOne.close()
# html parser
page_soup1 = soup(page_one, "html.parser")
# grabs each publication block
containers = page_soup1.findAll("a", {"class": "stream-article"} )
for container in containers:
link = container.attrs['href']
publication_date = "published on " + container.time.text
title = container.h3.text
description = "(CoinDesk)-- " + container.p.text
print("link: " + link)
print("publication_date: " + publication_date)
print("title: " + title)
print("description: " + description)
Upvotes: 2
Views: 6360
Reputation: 2073
Your time
tag has a datetime attribute that is giving a much better datetime representation than the text. Use that.
You can use the dateutil package to parse the string. Following is a sample code:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from datetime import datetime, timedelta
from dateutil import parser
import pytz
my_url = "https://www.coindesk.com/category/business-news/legal"
# Opening up the website, grabbing the page
uFeedOne = uReq(my_url, timeout=5)
page_one = uFeedOne.read()
uFeedOne.close()
# html parser
page_soup1 = soup(page_one, "html.parser")
# grabs each publication block
containers = page_soup1.findAll("a", {"class": "stream-article"} )
for container in containers:
## get todays date.
## I have taken an offset as the site has older articles than today.
today = datetime.now() - timedelta(days=5)
link = container.attrs['href']
## The actual datetime string is in the datetime attribute of the time tag.
date_time = container.time['datetime']
## we will use the dateutil package to parse the ISO-formatted date.
date = parser.parse(date_time)
## This date is UTC localised but the datetime.now() gives a "naive" date
## So we have to localize before comparison
utc=pytz.UTC
today = utc.localize(today)
## simple comparison
if date >= today:
print("article date", date)
print("yesterday", today," \n")
publication_date = "published on " + container.time.text
title = container.h3.text.encode('utf-8')
description = "(CoinDesk)-- " + container.p.text
print("link: " + link)
print("publication_date: " + publication_date)
print("title: ", title)
print("description: " + description)
Upvotes: 2
Reputation: 308
The date is represented in the ISO 8601 format. Extract the datetime
attribute as a string from the time
tag. If you are using python 3.7 you can use the datetime.datetime.fromisoformat
method to convert this to a datetime object and then do the comparison. If you are using an older version of python I think the easiest approach would be see this question and the first answer provided.
Upvotes: 0