Reputation: 20409
Here is the string:
u'\u041a\u0440\u0430\u0441\u0438\u043b\u044c\u043d\u0438\u043a\u043e\u0432 \u0421\u0435\u0440\u0433\u0435\u0439 \u0410\u043b\u0435\u043a\u0441\u0430\u043d\u0434\u0440\u043e\u0432\u0438\u0447'
if I try to .split()
that, it doesn't work - just one part is returned. What can be wrong here?
Upd. full code:
page = urllib.urlopen('http://www.rea.ru/Main.aspx?page=Krasil_nikov_Sergejj_Aleksandrovich')
soup = BeautifulSoup(page.read(), fromEncoding="utf-8")
full_name = soup.find('div', {'class': 'flagPageTitle'}).text.strip().split()
self.response.out.write(str(full_name) + '<br>')
Upvotes: 2
Views: 2249
Reputation: 2681
WIth python 3, to remove a  :
text = TEXT_WITH_NBSP.replace('\xa0','')
print(text)
Upvotes: 1
Reputation: 353159
Ah. See, the key was in information that you didn't post until requested. Your string isn't what it looks like:
[u'\u041a\u0440\u0430\u0441\u0438\u043b\u044c\u043d\u0438\u043a\u043e\u0432 \u0421\u0435\u0440\u0433\u0435\u0439 \u0410\u043b\u0435\u043a\u0441\u0430\u043d\u0434\u0440\u043e\u0432\u0438\u0447']
where instead of spaces, it's " "
, which is the non-break space character. There are several stackoverflow questions about the best way to remove these; I don't know enough to know which one is best.
[IOW, search for "BeautifulSoup nbsp".]
Upvotes: 7
Reputation: 24921
I run your code and I got:
>>> from BeautifulSoup import BeautifulSoup
>>> import urllib
>>> page = urllib.urlopen('http://www.rea.ru/Main.aspx?page=Krasil_nikov_Sergejj_Aleksandrovich')
>>> soup = BeautifulSoup(page.read(), fromEncoding="utf-8")
>>> print soup.find('div', {'class': 'flagPageTitle'}).text
Красильников Сергей Александрович
As you can see, the words aren't separed with a regular space, but with a html space (
or non breaking space). Using .split(' ')
you could solve your problem:
>>> full_name = soup.find('div', {'class': 'flagPageTitle'}).text.strip().split(' ')
>>> len(full_name)
3
>>> for s in full_name: print s
...
Красильников
Сергей
Александрович
Upvotes: 2
Reputation: 49567
Because your string is split by
not spaces.
>>> full_name = soup.find('div', {'class': 'flagPageTitle'}).text.strip()
>>> full_name
u'\u041a\u0440\u0430\u0441\u0438\u043b\u044c\u043d\u0438\u043a\u043e\u0432 \u0421\u0435\u0440\u0433\u0435\u0439 \u0410\u
043b\u0435\u043a\u0441\u0430\u043d\u0434\u0440\u043e\u0432\u0438\u0447'
>>> full_name.split(" ")
[u'\u041a\u0440\u0430\u0441\u0438\u043b\u044c\u043d\u0438\u043a\u043e\u0432', u'\u0421\u0435\u0440\u0433\u0435\u0439', u'\u0410\u0
43b\u0435\u043a\u0441\u0430\u043d\u0434\u0440\u043e\u0432\u0438\u0447']
>>> len(full_name.split(" "))
3
Upvotes: 0