Venkateshwaran Selvaraj
Venkateshwaran Selvaraj

Reputation: 1785

Trimming white spaces in python bs4

I am trying to remove whitespaces in the scraped data. I referred all the solutions that are available, but nothing seems to work.

Here is my code

    from bs4 import BeautifulSoup
import urllib2
url="http://www.sfap.org/klsfaprep_search?page=38&type=1&strname=&loc=&op=Lancer%20la%20recherche&form_build_id=form-72a297de309517ed5a2c28af7ed15208&form_id=klsfaprep_search_form"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
universities=soup.findAll('div',{'class':'field-item odd'})
for eachuniversity in universities:
 #print eachuniversity['href']+","+eachuniversity.string.encode('utf-8').strip()
 print eachuniversity.string if eachuniversity  else ''

The output I am getting is

                    EMSP
None
None

                    BP J5

                    98880

                    NOUMEA

                    Nouvelle-Calédonie

                    Intra établissement

                    Dr Chantal Barbe

                    [email protected]

                    00 687 25 66 66 (standard)

                    [email protected]

                    1078 (poste Dr Barbe)

                    Accueil stagiaire
None

                    Régional
None

But I want it to be

EMSP,None,None, BP J5,98880,NOUMEA,Nouvelle-Calédonie,Intra établissement,Dr Chantal Barbe, [email protected], 00 687 25 66 66 (standard), [email protected], 1078 (poste Dr Barbe),  Accueil stagiaire, None, Régional,None

When I tried other SO answers I got Nonetype attribute error.

Update I have improved my script as following

from bs4 import BeautifulSoup
import urllib2
url="http://www.sfap.org/klsfaprep_search?page=38&type=1&strname=&loc=&op=Lancer%20la%20recherche&form_build_id=form-72a297de309517ed5a2c28af7ed15208&form_id=klsfaprep_search_form"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('div',{'class':'field-item odd'}):
 print ''.join(eachuniversity.findAll(text=True)).encode('utf-8').strip()

This gives me the following output

EMSP
Nom de la structure: 
                    EMASP
Hôpital Gaston Bourret
BP J5
98880
NOUMEA
Nouvelle-Calédonie
Intra établissement
Dr Chantal Barbe
[email protected]
00 687 25 66 66 (standard)
[email protected]
1078 (poste Dr Barbe)
Accueil stagiaire
7h30 17h
Régional
ouverture équipe mobile depuis le 1 aout 2011
Travail au quotidien avec le malade sur demande médecin référent
Activités de formation intra et extra hospitalières sur toute la Nouvelle Calédonie auprès de professionnels de la santé, des auxiliaires de vie, des bénévoles, des prêtres....
Information auprès du grand public
Travail de recherche : étude des problèmes ethniques; évaluation du ressenti des malades walisien et /ou kanak sur l' approche SP  et propositions

But I want this to be in a single line with comma-separated.

Upvotes: 1

Views: 3175

Answers (1)

Hari Menon
Hari Menon

Reputation: 35445

To print on the same line, just add a , at the end of the print statement:

print ''.join(eachuniversity.findAll(text=True)).encode('utf-8').strip(),',',

You might want to remove newlines from the text.

print re.sub(r'\s+',' ',''.join(eachuniversity.findAll(text=True)).encode('utf-8')),',',

It will replace all consecutive whitespace characters including newlines with a single space.

Upvotes: 1

Related Questions