LA_
LA_

Reputation: 20409

Why I can not split by space?

Here is the string:

u'\u041a\u0440\u0430\u0441\u0438\u043b\u044c\u043d\u0438\u043a\u043e\u0432 \u0421\u0435\u0440\u0433\u0435\u0439 \u0410\u043b\u0435\u043a\u0441\u0430\u043d\u0434\u0440\u043e\u0432\u0438\u0447'

if I try to .split() that, it doesn't work - just one part is returned. What can be wrong here?

Upd. full code:

page = urllib.urlopen('http://www.rea.ru/Main.aspx?page=Krasil_nikov_Sergejj_Aleksandrovich')
soup = BeautifulSoup(page.read(), fromEncoding="utf-8")
full_name = soup.find('div', {'class': 'flagPageTitle'}).text.strip().split()
self.response.out.write(str(full_name) + '<br>')

Upvotes: 2

Views: 2249

Answers (4)

pcampana
pcampana

Reputation: 2681

WIth python 3, to remove a &nbsp:

text = TEXT_WITH_NBSP.replace('\xa0','')
print(text)

Upvotes: 1

DSM
DSM

Reputation: 353159

Ah. See, the key was in information that you didn't post until requested. Your string isn't what it looks like:

[u'\u041a\u0440\u0430\u0441\u0438\u043b\u044c\u043d\u0438\u043a\u043e\u0432&nbsp;\u0421\u0435\u0440\u0433\u0435\u0439&nbsp;\u0410\u043b\u0435\u043a\u0441\u0430\u043d\u0434\u0440\u043e\u0432\u0438\u0447']

where instead of spaces, it's "&nbsp;", which is the non-break space character. There are several stackoverflow questions about the best way to remove these; I don't know enough to know which one is best.

[IOW, search for "BeautifulSoup nbsp".]

Upvotes: 7

juliomalegria
juliomalegria

Reputation: 24921

I run your code and I got:

>>> from BeautifulSoup import BeautifulSoup
>>> import urllib
>>> page = urllib.urlopen('http://www.rea.ru/Main.aspx?page=Krasil_nikov_Sergejj_Aleksandrovich')
>>> soup = BeautifulSoup(page.read(), fromEncoding="utf-8")
>>> print soup.find('div', {'class': 'flagPageTitle'}).text
Красильников&nbsp;Сергей&nbsp;Александрович

As you can see, the words aren't separed with a regular space, but with a html space (&nbsp; or non breaking space). Using .split('&nbsp;') you could solve your problem:

>>> full_name = soup.find('div', {'class': 'flagPageTitle'}).text.strip().split('&nbsp;')
>>> len(full_name)
3
>>> for s in full_name: print s
... 
Красильников
Сергей
Александрович

Upvotes: 2

RanRag
RanRag

Reputation: 49567

Because your string is split by &nbsp; not spaces.

>>> full_name = soup.find('div', {'class': 'flagPageTitle'}).text.strip()
>>> full_name
u'\u041a\u0440\u0430\u0441\u0438\u043b\u044c\u043d\u0438\u043a\u043e\u0432&nbsp;\u0421\u0435\u0440\u0433\u0435\u0439&nbsp;\u0410\u
043b\u0435\u043a\u0441\u0430\u043d\u0434\u0440\u043e\u0432\u0438\u0447'

>>> full_name.split("&nbsp;")
[u'\u041a\u0440\u0430\u0441\u0438\u043b\u044c\u043d\u0438\u043a\u043e\u0432', u'\u0421\u0435\u0440\u0433\u0435\u0439', u'\u0410\u0
43b\u0435\u043a\u0441\u0430\u043d\u0434\u0440\u043e\u0432\u0438\u0447']
>>> len(full_name.split("&nbsp;"))
3

Upvotes: 0

Related Questions