Alex
Alex

Reputation: 347

What is this function doing in Python involving urllib2 and BeautifulSoup?

So I asked a question earlier about retrieving high scores form an html page and another user gave me the following code to help. I am new to python and beautifulsoup so I'm trying to go through some other codes piece by piece. I understand most of it but I dont get what this piece of code is and what its function is:

    def parse_string(el):
       text = ''.join(el.findAll(text=True))
       return text.strip()

Here is the entire code:

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import sys

URL = "http://hiscore.runescape.com/hiscorepersonal.ws?user1=" + sys.argv[1]

# Grab page html, create BeatifulSoup object
html = urlopen(URL).read()
soup = BeautifulSoup(html)

# Grab the <table id="mini_player"> element
scores = soup.find('table', {'id':'mini_player'})

# Get a list of all the <tr>s in the table, skip the header row
rows = scores.findAll('tr')[1:]

# Helper function to return concatenation of all character data in an element
def parse_string(el):
   text = ''.join(el.findAll(text=True))
   return text.strip()

for row in rows:

   # Get all the text from the <td>s
   data = map(parse_string, row.findAll('td'))

   # Skip the first td, which is an image
   data = data[1:]

   # Do something with the data...
   print data 

Upvotes: 1

Views: 439

Answers (1)

Eli Courtwright
Eli Courtwright

Reputation: 192981

el.findAll(text=True) returns all the text contained within an element and its sub-elements. By text I mean everything not inside a tag; so in <b>hello</b> then "hello" would be the text but <b> and </b> would not.

That function therefore joins together all text found beneath the given element and strips whitespace off from the front and back.

Here's a link to the findAll documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html#arg-text

Upvotes: 3

Related Questions