Reputation: 1785
I am trying to remove whitespaces in the scraped data. I referred all the solutions that are available, but nothing seems to work.
Here is my code
from bs4 import BeautifulSoup
import urllib2
url="http://www.sfap.org/klsfaprep_search?page=38&type=1&strname=&loc=&op=Lancer%20la%20recherche&form_build_id=form-72a297de309517ed5a2c28af7ed15208&form_id=klsfaprep_search_form"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
universities=soup.findAll('div',{'class':'field-item odd'})
for eachuniversity in universities:
#print eachuniversity['href']+","+eachuniversity.string.encode('utf-8').strip()
print eachuniversity.string if eachuniversity else ''
The output I am getting is
EMSP
None
None
BP J5
98880
NOUMEA
Nouvelle-Calédonie
Intra établissement
Dr Chantal Barbe
[email protected]
00 687 25 66 66 (standard)
[email protected]
1078 (poste Dr Barbe)
Accueil stagiaire
None
Régional
None
But I want it to be
EMSP,None,None, BP J5,98880,NOUMEA,Nouvelle-Calédonie,Intra établissement,Dr Chantal Barbe, [email protected], 00 687 25 66 66 (standard), [email protected], 1078 (poste Dr Barbe), Accueil stagiaire, None, Régional,None
When I tried other SO answers I got Nonetype attribute error.
Update I have improved my script as following
from bs4 import BeautifulSoup
import urllib2
url="http://www.sfap.org/klsfaprep_search?page=38&type=1&strname=&loc=&op=Lancer%20la%20recherche&form_build_id=form-72a297de309517ed5a2c28af7ed15208&form_id=klsfaprep_search_form"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('div',{'class':'field-item odd'}):
print ''.join(eachuniversity.findAll(text=True)).encode('utf-8').strip()
This gives me the following output
EMSP
Nom de la structure:
EMASP
Hôpital Gaston Bourret
BP J5
98880
NOUMEA
Nouvelle-Calédonie
Intra établissement
Dr Chantal Barbe
[email protected]
00 687 25 66 66 (standard)
[email protected]
1078 (poste Dr Barbe)
Accueil stagiaire
7h30 17h
Régional
ouverture équipe mobile depuis le 1 aout 2011
Travail au quotidien avec le malade sur demande médecin référent
Activités de formation intra et extra hospitalières sur toute la Nouvelle Calédonie auprès de professionnels de la santé, des auxiliaires de vie, des bénévoles, des prêtres....
Information auprès du grand public
Travail de recherche : étude des problèmes ethniques; évaluation du ressenti des malades walisien et /ou kanak sur l' approche SP et propositions
But I want this to be in a single line with comma-separated.
Upvotes: 1
Views: 3175
Reputation: 35445
To print on the same line, just add a ,
at the end of the print
statement:
print ''.join(eachuniversity.findAll(text=True)).encode('utf-8').strip(),',',
You might want to remove newlines from the text.
print re.sub(r'\s+',' ',''.join(eachuniversity.findAll(text=True)).encode('utf-8')),',',
It will replace all consecutive whitespace characters including newlines with a single space.
Upvotes: 1