monkey doodle
monkey doodle

Reputation: 700

Python, attain a certain text from html

I am trying to attain a certain text that is written in korean. Is there a more efficient way of doing this, rather than converting it to a string and parsing it from there?

CODE:

#input:     url
#output:    name
def urlSC(url):
    soup = BeautifulSoup(urllib2.urlopen(url).read())
    name = soup.find('span', id = 'lblKName')

OUTPUT:

<span id="lblKName">구세군앵커리지한인교회<br>The Salvation Army Anch. Korean Corps.</br></span>

Want: 구세군앵커리지한인교회

url: http://www.koreanchurchyp.com/ViewDetail.aspx?OrgID=4102

Upvotes: 0

Views: 63

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

If the korean part of the text is always at the first part before a br tag, you can use :

name = soup.find(id = 'lblKName').contents[0]

Upvotes: 2

johntellsall
johntellsall

Reputation: 15160

Tips:

  1. BeautifulSoup can take a file handle, vs the HTML string. This is slightly simpler, and might be faster if your text is nearer the beginning of the page.

    soup = BeautifulSoup(urllib2.urlopen(url))
    
  2. Another option is a regular expression. They're quite fast, but also a challenge to build correctly, and will break if the page format changes. Stick with BeautifulSoup unless you get stuck.

  3. BeautifulSoup can use several different parser libraries, with different space/time/reliability tradeoffs. See: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Upvotes: 0

Related Questions