Reputation: 33
I am trying to take the data of a given day from this timetable: click here
I have been able to use Beautiful Soup to add an entire row for any given day (in this case, monday or 'Mon') to a list using this code:
from BeautifulSoup import BeautifulSoup
day ='Mon'
with open('timetable.txt', 'rt') as input_file:
html = input_file.read()
soup = BeautifulSoup(html)
#finds correct day tag
starttag = soup.find(text=day).parent.parent
print starttag
nexttag = starttag
row=[]
x = 0
#puts all td tags for that day in a list
while x < 18:
nexttag = nexttag.nextSibling.nextSibling
row.append(nexttag)
x += 1
print row
as you can see, the command returns a list of TD tags, which make up the 'mon' row in the timetable.
My problem is, I don't know how to further parse or search the returned list to find the relevant info (COMP1740 etc.).
If I can find out how to search each element in the list for the module codes, I can then concatenate them alongside another list of timings, giving the timetable for a single day.
All help is welcome! (including completely different approaches)
Upvotes: 2
Views: 1009
Reputation: 33
from BeautifulSoup import BeautifulSoup
import re
#day input
day ='Thu'
#searches for a module (where html has rowspan="1")
module = re.compile(r'rowspan=\"1\"')
#lengths of module search (depending on html colspan attribute)
#1.5 hour
perlen15 = re.compile(r'colspan=\"3\"')
#2 hour
perlen2 = re.compile(r'colspan=\"4\"')
#2.5 hour etc.
perlen25 = re.compile(r'colspan=\"5\"')
perlen3 = re.compile(r'colspan=\"6\"')
perlen35 = re.compile(r'colspan=\"7\"')
perlen4 = re.compile(r'colspan=\"8\"')
#times correspond to first row of timetable.
times = ['8:00', '8:30', '9:00', '9:30', '10:00', '10:30', '11:00', '11:30', '12:00', '12:30', '13:00', '13:30', '14:00', '14:30', '15:00', '15:30']
#opens full timetable html
with open('timetable.txt', 'rt') as input_file:
html = input_file.read()
soup = BeautifulSoup(html)
#finds correct day tag
starttag = soup.find(text=day).parent.parent
nexttag = starttag
row=[]
#movement of cursor iterating over times list
curmv = 0
#puts following td tags for that day in a list
for time in times:
nexttag = nexttag.nextSibling.nextSibling
#detect if a module is found
found = module.search(repr(nexttag))
#detect length of that module
hour15 = perlen15.search(repr(nexttag))
hour2 = perlen2.search(repr(nexttag))
hour25 = perlen25.search(repr(nexttag))
hour3 = perlen3.search(repr(nexttag))
hour35 = perlen35.search(repr(nexttag))
hour4 = perlen4.search(repr(nexttag))
if found:
row.append(times[curmv])
row.append(nexttag)
if hour15:
curmv += 3
elif hour2:
curmv += 4
elif hour25:
curmv += 5
elif hour3:
curmv += 6
elif hour35:
curmv += 7
elif hour4:
curmv += 8
else:
curmv += 2
else:
curmv += 1
#write day to html file
with open('output.html', 'wt') as output_file:
for e in row:
output_file.write(str(e))
as you can see, the code can differentiate between one hour and 2 hour long lectures as well as 1.5, 2.5 hour longs ones etc.
my only problem now is line 32, I need a better way to tell the code to stop moving horizontally across the table aka: knowing when to stop the for loop (in the previous code I had while x < 18:
which only worked for monday because there were 18 td tags in the row. How can I get the loop to stop when it hits the parent </tr>
tag?
thanks!
EDIT: I'm going to try using a try and except block to catch the error I get if I set the 'times' set all the way up to 18:00.
EDIT2: IT WORKED! :D
Upvotes: 0
Reputation: 86220
You can find information like the course numbers using Regular Expressions, i.e., pattern matching.
I don't know your experience with them, but Python includes a 're' module. Looking at the pattern of "The four letters C-O-M-P followed by one or more digits." gives a RegEx of COMP\d+
where \d
is one digit and the following +
says to look for as many as possible (in this case, 4).
from BeautifulSoup import BeautifulSoup
import re
day ='Mon'
codePat = re.compile(r'COMP\d+')
with open('timetable.txt', 'rt') as input_file:
html = input_file.read()
soup = BeautifulSoup(html)
#finds correct day tag
starttag = soup.find(text=day).parent.parent
# print starttag
nexttag = starttag
row=[]
x = 0
#puts all td tags for that day in a list
while x < 18:
nexttag = nexttag.nextSibling.nextSibling
found = codePat.search(repr(nexttag))
if found:
row.append(found.group(0))
x += 1
print row
This gives me the output,
['COMP1940', 'COMP1550', 'COMP1740']
Like I said, I don't know where your knowledge of Regular Expressions is, so if you can describe the patterns I can try to write them. Here's a good resource if you decide to do it on your own.
Upvotes: 1