Reputation: 333
I have a series of strings that are all something like "Saturday, December 27th 2014" and I want to toss the "Saturday" and save the file with the name "141227" which is year + month + day. So far, everything is working except I can't get the regex for daypos or yearpos to work. They both give the same error:
Traceback (most recent call last): File "scrapewaybackblog.py", line 17, in daypos = byline.find(re.compile("[A-Z][a-z]*\s")) TypeError: expected a character buffer object
What's a character buffer object? Does that mean there's something wrong with my expression? Here's my script:
for i in xrange(3, 1, -1):
page = urllib2.urlopen("http://web.archive.org/web/20090204221349/http://www.americansforprosperity.org/nationalblog?page={}".format(i))
soup = BeautifulSoup(page.read())
snippet = soup.find_all('div', attrs={'class': 'blog-box'})
for div in snippet:
byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8')
text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8')
monthpos = byline.find(",")
daypos = byline.find(re.compile("[A-Z][a-z]*\s"))
yearpos = byline.find(re.compile("[A-Z][a-z]*\D\d*\w*\s"))
endpos = monthpos + len(byline)
month = byline[monthpos+1:daypos]
day = byline[daypos+0:yearpos]
year = byline[yearpos+2:endpos]
output_files_pathname = 'Data/' # path where output will go
new_filename = year + month + day + ".txt"
outfile = open(output_files_pathname + new_filename,'w')
outfile.write(date)
outfile.write("\n")
outfile.write(text)
outfile.close()
print "finished another url from page {}".format(i)
I also haven't figured out how to make December = 12 but that's for another time. Just please help me find the right positions.
Upvotes: 2
Views: 1108
Reputation: 473863
Instead of parsing a date string with regex, parse it with dateutil
:
from dateutil.parser import parse
for div in soup.select('div.blog-box'):
byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8')
text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8')
dt = parse(byline)
new_filename = "{dt.year}{dt.month}{dt.day}.txt".format(dt=dt)
...
Or, you can parse the string with datetime.strptime()
, but you need to take care of suffixes:
byline = re.sub(r"(?<=\d)(st|nd|rd|th)", "", byline)
dt = datetime.strptime(byline, '%A, %B %d %Y')
re.sub()
here finds st
or nd
or rd
or th
string after a digit and replaces the suffixes with empty strings. After it a date string would match '%A, %B %d %Y'
format, see:
Some additional notes:
urlopen()
directly to the BeautifulSoup
constructorfind_all()
by class name, use a CSS Selector div.blog-box
os.path.join()
with
context manager when dealing with filesFixed version:
import os
import urllib2
from bs4 import BeautifulSoup
from dateutil.parser import parse
for i in xrange(3, 1, -1):
page = urllib2.urlopen("http://web.archive.org/web/20090204221349/http://www.americansforprosperity.org/nationalblog?page={}".format(i))
soup = BeautifulSoup(page)
for div in soup.select('div.blog-box'):
byline = div.find('div', attrs={'class': 'date'}).text.encode('utf-8')
text = div.find('div', attrs={'class': 'right-box'}).text.encode('utf-8')
dt = parse(byline)
new_filename = "{dt.year}{dt.month}{dt.day}.txt".format(dt=dt)
with open(os.path.join('Data', new_filename), 'w') as outfile:
outfile.write(byline)
outfile.write("\n")
outfile.write(text)
print "finished another url from page {}".format(i)
Upvotes: 5