Reputation: 700
I am trying to attain a certain text that is written in korean. Is there a more efficient way of doing this, rather than converting it to a string and parsing it from there?
CODE:
#input: url
#output: name
def urlSC(url):
soup = BeautifulSoup(urllib2.urlopen(url).read())
name = soup.find('span', id = 'lblKName')
OUTPUT:
<span id="lblKName">구세군앵커리지한인교회<br>The Salvation Army Anch. Korean Corps.</br></span>
Want: 구세군앵커리지한인교회
url: http://www.koreanchurchyp.com/ViewDetail.aspx?OrgID=4102
Upvotes: 0
Views: 63
Reputation: 89547
If the korean part of the text is always at the first part before a br tag, you can use :
name = soup.find(id = 'lblKName').contents[0]
Upvotes: 2
Reputation: 15160
Tips:
BeautifulSoup can take a file handle, vs the HTML string. This is slightly simpler, and might be faster if your text is nearer the beginning of the page.
soup = BeautifulSoup(urllib2.urlopen(url))
Another option is a regular expression. They're quite fast, but also a challenge to build correctly, and will break if the page format changes. Stick with BeautifulSoup unless you get stuck.
BeautifulSoup can use several different parser libraries, with different space/time/reliability tradeoffs. See: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Upvotes: 0