davidawad
davidawad

Reputation: 1043

Simple Web Scraping with Python

I haven't been able to find a simple way to do this, i have been following this and I have written the following,

##just comments before this
    import lxml,requests
 23 page = requests.get('https://finalexams.rutgers.edu.html')
 24 
 25 tree = html.fromstring(page.text)
 26 
 27 tableRow = tree.xpath('//tr/text() ' )
 28 
 29 print 'Rows' , tableRow

That script needs to parse through table rows like these and take out the things inside of them, but there could be a potentially infinite amount of table rows. I don't know how to access nested tags and they don't have unique names or ID's for me to look for.

How can I write a for loop that gets each of these table rows and lets me grab the individual bits of them?

  <tr>
    <td> 04264</td>
    <td>01:198:205</td>
    <td>01</td>
    <td>INTR DISCRET STRCT I</td>



  <td>C</td>
  <td>Dec 17, 2014:  8:00 AM - 11:00 AM </td>




  </tr>

  <tr>
    <td> 09907</td>
    <td>01:198:214</td>
    <td>01</td>
    <td>SYSTEMS PROGRAMMING</td>



  <td>C</td>
  <td>Dec 18, 2014:  8:00 PM - 11:00 PM </td>




  </tr>

Upvotes: 0

Views: 164

Answers (2)

abarnert
abarnert

Reputation: 365717

If you want to find the tr elements themselves, instead of their (empty) text, just search for the tr elements, instead of their text:

rows = tree.xpath('//tr')

And then you can iterate them:

for row in rows:

And then you can either search each one for td elements (e.g., by using row.xpath, or row.findall, etc.), or just assume all their children are td elements (as they happen to be in this case):

    for column in row:

And then you can do whatever it is you wanted to do with each column, like extract its text:

        print column.text

Upvotes: 3

alecxe
alecxe

Reputation: 473863

Iterate over all tr tags and make an inner loop over td tags for every row, example:

from lxml.html import fromstring

data = """
your html here
"""

root = fromstring(data)
for index, row in enumerate(root.xpath('//table/tr')):
    print "Row #%s" % index

    for cell in row.findall('td'):
        print cell.text.strip()

    print "----"

Prints:

Row #0
04264
01:198:205
01
INTR DISCRET STRCT I
C
Dec 17, 2014:  8:00 AM - 11:00 AM
----
Row #1
09907
01:198:214
01
SYSTEMS PROGRAMMING
C
Dec 18, 2014:  8:00 PM - 11:00 PM
----

Upvotes: 0

Related Questions