rahul kapoor
rahul kapoor

Reputation: 21

How to parse more than one tr tag in python

i am currently having problem parsing all the tr tags that occur in a table, i am able to parse the first tr tag but i am unable to understand how can i parse all the subsequent tr tags, i have thought of using a for loop but it didn't work. i have included only partial code which includes the tr tags that i want to store in json file.

Here is what i tried:

def parseFacultyPage(br, facultyID):
    if br is None:
        return None

    br.open('https://academics.vit.ac.in/student/stud_home.asp')
    response = br.open('https://academics.vit.ac.in/student/class_message_view.asp?sem=' + facultyID)
    html = response.read()
    soup = BeautifulSoup(html)
    tables = soup.findAll('table')

    # Extracting basic information of the faculty
    infoTable = tables[0].findAll('tr')
    name = infoTable[2].findAll('td')[0].text
    if (len(name) is 0):
        return None
    subject = infoTable[2].findAll('td')[1].text
    msg = infoTable[2].findAll('td')[2].text
    sent = infoTable[2].findAll('td')[3].text
    emailmsg = 'Subject: New VIT Email' + msg

Here is the sample html code if the tr tag exists more than one.

<table width="79%" border="0" cellpadding="0" cellspacing="0" height="350">
  <tr>
    <td valign="top" width="1%" bgcolor=#FFFFFF>
        &nbsp;
    </td>
    <td valign="top" width="78%" bgcolor=#FFFFFF>



    <center><b><u>VIEW CLASS MESSAGE - Winter Semester 2015~16</u></b></center>
    <br><br>


        <br>
        <table cellpadding=4 cellspacing=2 border=0 bordercolor='black' width="100%">

        <tr bgcolor=#5A768D>
            <td width="25%"><font color=#FFFFFF>From</font></td>
            <td width="25%"><font color=#FFFFFF>Course</font></td>
            <td><font color=#FFFFFF>Message</font></td>
            <td width="10%"><font color=#FFFFFF>Posted On</font></td>
        </tr>

            <tr bgcolor="#EDEADE" onMouseOut="this.bgColor='#EDEADE'" onMouseOver="this.bgColor='#FFF9EA'">
                <td valign="top">RAGHAVAN R (SITE)</td>
                <td valign="top">ITE308 - Distributed Systems - TH</td>
                <td valign="top">Dear students,

As informed in the class, this is to remind you Today special class from 6 to 6.50 pm at same venue SJT 126.

regards

R. Raghavan
SITE</td>
                <td valign="top">11/02/2016 11:42:57</td>
            </tr>

            <tr bgcolor="#EDEADE" onMouseOut="this.bgColor='#EDEADE'" onMouseOver="this.bgColor='#FFF9EA'">
                <td valign="top">SMART (APT) (ACAD)</td>
                <td valign="top">STS302 - Soft Skills - SS</td>
                <td valign="top">Dear Students,

As  04 Feb 16 to 08 Feb 16 were announced as “No Instruction days”, the first assessment that was supposed to happen from 08 Feb 16 to 12 Feb 16 is being postponed to 7th week (15 Feb 16 to 19 Feb 16)
</td>
                <td valign="top">10/02/2016 21:48:14</td>
            </tr>

        <tr bgcolor=#5A768D>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
        </tr>

        </table>


    <br><br>
    </td>
  </tr>
</table>

Upvotes: 0

Views: 173

Answers (1)

Obsidian
Obsidian

Reputation: 515

You should first iterate throw the rows like below and in each row, query the columns into the columns variable at the start

for index, row in enumerate(tables[1].findAll('tr')):
    if index==0:
        continue

    columns= row.findAll('td')
    name = columns[0].text
    if not name:
        return None
    subject = columns[1].text
    msg = columns[2].text
    sent = columns[3].text

EDIT: Looks like your html has two table structures. You need the inner one. So, use index 1 instead tables[1]

I've also added enumerate around the iterator so you also have the row index. And using this, you can skip the header row, when index==0

Upvotes: 3

Related Questions