Reputation: 181
I'm attempting to create a dictionary of hours based off of this calendar: http://disneyworld.disney.go.com/parks/magic-kingdom/calendar/
<td class="first"> <div class="dayContainer">
<a href="/parks/magic-kingdom/calendardayview/?asmbly_day=20120401">
<p class="day"> 1
</p> <p class="moreLink">Park Hours<br />8:00 AM - 12:00 AM<br /><br/>Extra Magic Hours<br />7:00 AM - 8:00 AM<br /><br/>Extra Magic Hours<br />12:00 AM - 3:00 AM<br /><br/>
</p>
</a>
</div>
</td>
Each of the calendar entries are on a single line, so I figured it would be best to just go through the HTML line by line, and if that line contains hours, add those hours to a dictionary for the corresponding date (some days have multiple hour entries).
import urllib
import re
source = urllib.urlopen('http://disneyworld.disney.go.com/parks/magic-kingdom/c\
alendar/')
page = source.read()
prkhrs = {}
def main():
parsehours()
def parsehours():
#look for #:## AM - #:## PM
date = r'201204\d{02}'
hours = r'\d:0{2}\s\w{2}\s-\s\d:0{2}\s\w{2}'
#go through page line by line
for line in page:
times = re.findall(hours, line)
dates = re.search(date, line)
if dates:
start = dates.start()
end = dates.end()
curdate = line[start:end]
#if #:## - #:## is found, a date has been found
if times:
#create dictionary from date, stores hours in variable
#extra magic hours(emh) are stored in same format.
#if entry has 2/3 hour listings, those listings are emh
prkhrs[curdate]['hours'] = times
#just print hours for now. will change later
print prkhrs
The problem I encounter is that when I put 'print line' inside the for loop that goes through the page, it prints it out a character at a time, which I'm assuming is what's messing things up.
Right now, the 'print prkhrs' just prints nothing, but using re.findall for both the dates and the hours prints out the correct times, so I know the regex works. Any suggestions on how I can get it to work?
Upvotes: 0
Views: 3216
Reputation: 10366
Change page = source.read()
to page = source.readlines()
source.read()
returns the whole page as one big string. Iterating over a string (as when you do for line in page
) returns one character at a time. Just because your variables are called line
and page
doesn't mean Python knows what you want.
source.readlines()
returns a list of strings, each of which is a line from the page.
Upvotes: 6