jorilallo
jorilallo

Reputation: 788

Encoding issue with BeautifulSoup

I'm running into a encoding issue with BeautifulSoup. I'm trying to parse Open Graph titles but it's leaving out non-ascii characters.

from bs4 import BeautifulSoup
doc = BeautifulSoup(html,"lxml")
doc.html.head.findAll('meta',attrs={'property':'og:title'})

For http://mattilintulahti.net/mediablogi/2013/02/11/19-asiaa-joita-et-tieda-mediayhtiosta-nimeltaan-red-bull/ it prints out the following for the content

19 asiaa joita et tied mediayhtist nimeltn Red Bull

Where the correct one is

19 asiaa joita et tiedä mediayhtiöstä nimeltään Red Bull

Any advice on how to get utf-8 to works properly?

Upvotes: 1

Views: 95

Answers (1)

unutbu
unutbu

Reputation: 880717

I'm not able to reproduce the problem:

import urllib2
import bs4 as bs
url = 'http://mattilintulahti.net/mediablogi/2013/02/11/19-asiaa-joita-et-tieda-mediayhtiosta-nimeltaan-red-bull/'
html = urllib2.urlopen(url).read()
doc = bs.BeautifulSoup(html, 'lxml')
for meta in doc.html.head.findAll('meta', attrs={'property': 'og:title'}):
    print(meta.attrs['content'])

yields

19 asiaa joita et tiedä mediayhtiöstä nimeltään Red Bull

If this doesn't help, please show your your code.

Upvotes: 1

Related Questions