Reputation: 15
I am trying to scrape and sort articles with a body, headline, and date column. However, when pulling the date, I’m running into an error with the time zone:
ValueError: time data 'Jun 1, 2022 2:49PM EDT' does not match format '%b %d, %Y %H:%M%p %z'
My code is as follows:
def get_info(url):
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text)
news = soup.find('div', attrs={'class': 'body__content'}).text
headline = soup.find('h1').text
date = datetime.datetime.strptime(soup.find('time').text, "%b %d, %Y %H:%M%p %z")
columns = [news, headline, date]
column_names = ['News','Headline','Date']
return dict(zip(column_names, columns))
Is there a way to grab the time zone in a similar method or just drop it overall?
Upvotes: 1
Views: 424
Reputation: 23738
Note %z in strptime() is for timezone offsets not names and %Z only accepts certain values for time zones. For details see API docs.
Simplest option is to use dateparser module to parse dates with time zone names (e.g. EDT).
import dateparser
s = "Jun 1, 2022 2:49PM EDT"
d = dateparser.parse(s)
print(d)
Output:
2022-06-01 14:49:00-04:00
Many of the date modules (e.g. dateutil and pytz) have timezone offsets defined for "EST", "PST", etc. but "EDT" is less common. These modules would need you to define the timezone with the offset as UTC-04:00.
import dateutil.parser
s = "Jun 1, 2022 2:49PM EDT"
tzinfos = {"EDT": -14400}
d = dateutil.parser.parse(s, tzinfos=tzinfos)
print(d)
Output:
2022-06-01 14:49:00-04:00
Upvotes: 2
Reputation: 72
As alternate to @CodeMonkey solution, you may also try it by pandas :
pd.to_datetime('Jun 1, 2022 2:49PM EDT').tz_localize('US/Eastern')
Upvotes: 0