Reputation: 1785
<div class="members_box_second">
<div class="members_box0">
<p>1</p>
</div>
<div class="members_box1">
<p class="clear"><b>Name:</b><span>Mr.Jagadhesan.S</span></p>
<p class="clear"><b>Designation:</b><span>Proprietor</span></p>
<p class="clear"><b>CODISSIA - Designation:</b><span>(Founder President, CODISSIA)</span></p>
<p class="clear"><b>Name of the Industry:</b><span>Govardhana Engineering Industries</span></p>
<p class="clear"><b>Specification:</b><span>LIFE</span></p>
<p class="clear"><b>Date of Admission:</b><span>19.12.1969</span></p>
</div>
<div class="members_box2">
<p>Ukkadam South</p>
<p class="clear"><b>Phone:</b><span>2320085, 2320067</span></p>
<p class="clear"><b>Email:</b><span><a href="mailto:[email protected]">[email protected]</a></span></p>
</div>
</div>
<div class="members_box">
<div class="members_box0">
<p>2</p>
</div>
<div class="members_box1">
<p class="clear"><b>Name:</b><span>Mr.Somasundaram.A</span></p>
<p class="clear"><b>Designation:</b><span>Proprietor</span></p>
<p class="clear"><b>Name of the Industry:</b><span>Everest Engineering Works</span></p>
<p class="clear"><b>Specification:</b><span>LIFE</span></p>
<p class="clear"><b>Date of Admission:</b><span>19.12.1969</span></p>
</div>
<div class="members_box2">
<p>Alagar Nivas, 284 NSR Road</p>
<p class="clear"><b>Phone:</b><span>2435674</span></p>
<h4>Factory Address</h4>
Coimbatore - 641 027
<p class="clear"><b>Phone:</b><span>2435674</span></p>
</div>
</div>
I have the above structure. From that I am trying to scrape the texts inside div
of class
members_box1 and members_box2 only.
I have the following script which does get data from only members_box1
from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('div',{'class':'members_box1'}):
data = [re.sub('\s+', ' ', text).strip().encode('utf8') for text in eachuniversity.find_all(text=True) if text.strip()]
print '\n'
This is how I tried to get data from both the boxes
from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
soup = BeautifulSoup(page.read())
eachbox2 = soup.findAll('div ', {'class':'members_box2'})
for eachuniversity in soup.findAll('div',{'class':'members_box1'}):
data = eachbox2 + [re.sub('\s+', ' ', text).strip().encode('utf8') for text in eachuniversity.find_all(text=True) if text.strip()]
print data
But I am getting the same result as I get for just members_box1
UPDATE
I want to the output to be like this (in single line) for an iteration
Name:,Mr.Srinivasan.N,Designation:,Proprietor,CODISSIA - Designation:,(Past President, CODISSIA),Name of the Industry:,Arian Soap Manufacturing Co,Specification:,LIFE,Date of Admission:,19.12.1969, "Parijaat" 26/1Shanker Mutt Road, Basavana Gudi,Phone:,2313861
But I am getting as follows
Name:,Mr.Srinivasan.N,Designation:,Proprietor,CODISSIA - Designation:,(Past President, CODISSIA),Name of the Industry:,Arian Soap Manufacturing Co,Specification:,LIFE,Date of Admission:,19.12.1969
"Parijaat" 26/1Shanker Mutt Road, Basavana Gudi,Phone:,2313861
Upvotes: 2
Views: 1233
Reputation: 880757
You could use regex
to match either members_box1
or members_box2
:
import re
eachbox = soup.findAll('div', {'class':re.compile(r'members_box[12]')})
for eachuniversity in eachbox:
For example,
import bs4 as bs
import urllib2
import re
import csv
page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
content = page.read()
soup = bs.BeautifulSoup(content)
with open('/tmp/ccc.csv', 'wb') as f:
writer = csv.writer(f, delimiter=',', lineterminator='\n', )
eachbox = soup.find_all('div', {'class':re.compile(r'members_box[12]')})
for pair in zip(*[iter(eachbox)]*2):
writer.writerow([text.strip() for item in pair for text in item.stripped_strings])
Note that you must remove the stray space after div
in
soup.findAll('div ')
in order to find any <div>
tags.
The code above uses the very handy grouper idiom:
zip(*[iter(iterable)]*n)
This expression collects n
items from iterable
and groups them into a tuple. So this expression allows you to iterate over chunks of n
items. I've made a poor attempt to explain how the grouper idiom works here.
Upvotes: 3
Reputation: 366083
The problem is that you're adding eachbox2
to each data
, instead of to the list of things to loop over.
On top of that, you've got a stray space, 'div '
instead of 'div'
, that causes eachbox2
to be an empty list.
Try this:
eachbox1 = soup.findAll('div', {'class':'members_box1'})
eachbox2 = soup.findAll('div', {'class':'members_box2'})
for eachuniversity in eachbox1 + eachbox2:
data = [re.sub('\s+', ' ', text).strip().encode('utf8') for text in eachuniversity.find_all(text=True) if text.strip()]
This isn't really the best way to do things, it's just the simplest fix for your existing way of doing things. BeautifulSoup offers various different ways to search for multiple things in one query—e.g., you can search based on a tuple of values ('members_box1', 'members_box2')
, or a regexp (re.compile(r'members_box[12]')
), or a filter function (lambda c: c in 'members_box1', 'members_box2'
)…
Upvotes: 3