Reputation: 21
i am currently having problem parsing all the tr tags that occur in a table, i am able to parse the first tr tag but i am unable to understand how can i parse all the subsequent tr tags, i have thought of using a for loop but it didn't work. i have included only partial code which includes the tr tags that i want to store in json file.
Here is what i tried:
def parseFacultyPage(br, facultyID):
if br is None:
return None
br.open('https://academics.vit.ac.in/student/stud_home.asp')
response = br.open('https://academics.vit.ac.in/student/class_message_view.asp?sem=' + facultyID)
html = response.read()
soup = BeautifulSoup(html)
tables = soup.findAll('table')
# Extracting basic information of the faculty
infoTable = tables[0].findAll('tr')
name = infoTable[2].findAll('td')[0].text
if (len(name) is 0):
return None
subject = infoTable[2].findAll('td')[1].text
msg = infoTable[2].findAll('td')[2].text
sent = infoTable[2].findAll('td')[3].text
emailmsg = 'Subject: New VIT Email' + msg
Here is the sample html code if the tr tag exists more than one.
<table width="79%" border="0" cellpadding="0" cellspacing="0" height="350">
<tr>
<td valign="top" width="1%" bgcolor=#FFFFFF>
</td>
<td valign="top" width="78%" bgcolor=#FFFFFF>
<center><b><u>VIEW CLASS MESSAGE - Winter Semester 2015~16</u></b></center>
<br><br>
<br>
<table cellpadding=4 cellspacing=2 border=0 bordercolor='black' width="100%">
<tr bgcolor=#5A768D>
<td width="25%"><font color=#FFFFFF>From</font></td>
<td width="25%"><font color=#FFFFFF>Course</font></td>
<td><font color=#FFFFFF>Message</font></td>
<td width="10%"><font color=#FFFFFF>Posted On</font></td>
</tr>
<tr bgcolor="#EDEADE" onMouseOut="this.bgColor='#EDEADE'" onMouseOver="this.bgColor='#FFF9EA'">
<td valign="top">RAGHAVAN R (SITE)</td>
<td valign="top">ITE308 - Distributed Systems - TH</td>
<td valign="top">Dear students,
As informed in the class, this is to remind you Today special class from 6 to 6.50 pm at same venue SJT 126.
regards
R. Raghavan
SITE</td>
<td valign="top">11/02/2016 11:42:57</td>
</tr>
<tr bgcolor="#EDEADE" onMouseOut="this.bgColor='#EDEADE'" onMouseOver="this.bgColor='#FFF9EA'">
<td valign="top">SMART (APT) (ACAD)</td>
<td valign="top">STS302 - Soft Skills - SS</td>
<td valign="top">Dear Students,
As 04 Feb 16 to 08 Feb 16 were announced as “No Instruction days”, the first assessment that was supposed to happen from 08 Feb 16 to 12 Feb 16 is being postponed to 7th week (15 Feb 16 to 19 Feb 16)
</td>
<td valign="top">10/02/2016 21:48:14</td>
</tr>
<tr bgcolor=#5A768D>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
</table>
<br><br>
</td>
</tr>
</table>
Upvotes: 0
Views: 173
Reputation: 515
You should first iterate throw the rows like below and in each row, query the columns into the columns
variable at the start
for index, row in enumerate(tables[1].findAll('tr')):
if index==0:
continue
columns= row.findAll('td')
name = columns[0].text
if not name:
return None
subject = columns[1].text
msg = columns[2].text
sent = columns[3].text
EDIT: Looks like your html has two table structures. You need the inner one. So, use index 1 instead tables[1]
I've also added enumerate
around the iterator so you also have the row index. And using this, you can skip the header row, when index==0
Upvotes: 3