Venkateshwaran Selvaraj
Venkateshwaran Selvaraj

Reputation: 1785

Scraping using beautiful soup python

<div class="members_box_second">
                    <div class="members_box0">
                        <p>1</p>
                    </div>
                    <div class="members_box1">
                        <p class="clear"><b>Name:</b><span>Mr.Jagadhesan.S</span></p>
                        <p class="clear"><b>Designation:</b><span>Proprietor</span></p>
                        <p class="clear"><b>CODISSIA - Designation:</b><span>(Founder President, CODISSIA)</span></p>
                        <p class="clear"><b>Name of the Industry:</b><span>Govardhana Engineering Industries</span></p>
                        <p class="clear"><b>Specification:</b><span>LIFE</span></p>
                        <p class="clear"><b>Date of Admission:</b><span>19.12.1969</span></p>
                    </div>
                    <div class="members_box2">
                        <p>Ukkadam South</p>
                        <p class="clear"><b>Phone:</b><span>2320085, 2320067</span></p>
                        <p class="clear"><b>Email:</b><span><a href="mailto:[email protected]">[email protected]</a></span></p>                       
                    </div>
</div>
<div class="members_box">
                    <div class="members_box0">
                        <p>2</p>
                    </div>
                    <div class="members_box1">
                        <p class="clear"><b>Name:</b><span>Mr.Somasundaram.A</span></p>
                        <p class="clear"><b>Designation:</b><span>Proprietor</span></p>

                        <p class="clear"><b>Name of the Industry:</b><span>Everest Engineering Works</span></p>
                        <p class="clear"><b>Specification:</b><span>LIFE</span></p>
                        <p class="clear"><b>Date of Admission:</b><span>19.12.1969</span></p>
                    </div>
                    <div class="members_box2">
                        <p>Alagar Nivas, 284 NSR Road</p>
                        <p class="clear"><b>Phone:</b><span>2435674</span></p>      
                        <h4>Factory Address</h4>
                        Coimbatore - 641 027
                        <p class="clear"><b>Phone:</b><span>2435674</span></p>
                    </div>
</div>

I have the above structure. From that I am trying to scrape the texts inside div of class members_box1 and members_box2 only.

I have the following script which does get data from only members_box1

from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('div',{'class':'members_box1'}):
    data =  [re.sub('\s+', ' ', text).strip().encode('utf8') for text in eachuniversity.find_all(text=True) if text.strip()]
    print '\n'

This is how I tried to get data from both the boxes

from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
soup = BeautifulSoup(page.read())
eachbox2 = soup.findAll('div ', {'class':'members_box2'})
for eachuniversity in soup.findAll('div',{'class':'members_box1'}):
    data =  eachbox2 + [re.sub('\s+', ' ', text).strip().encode('utf8') for text in eachuniversity.find_all(text=True) if text.strip()]
    print data

But I am getting the same result as I get for just members_box1

UPDATE

I want to the output to be like this (in single line) for an iteration

Name:,Mr.Srinivasan.N,Designation:,Proprietor,CODISSIA - Designation:,(Past President, CODISSIA),Name of the Industry:,Arian Soap Manufacturing Co,Specification:,LIFE,Date of Admission:,19.12.1969, "Parijaat" 26/1Shanker Mutt Road, Basavana Gudi,Phone:,2313861

But I am getting as follows

Name:,Mr.Srinivasan.N,Designation:,Proprietor,CODISSIA - Designation:,(Past President, CODISSIA),Name of the Industry:,Arian Soap Manufacturing Co,Specification:,LIFE,Date of Admission:,19.12.1969
"Parijaat" 26/1Shanker Mutt Road, Basavana Gudi,Phone:,2313861

Upvotes: 2

Views: 1233

Answers (2)

unutbu
unutbu

Reputation: 880757

You could use regex to match either members_box1 or members_box2:

import re
eachbox = soup.findAll('div', {'class':re.compile(r'members_box[12]')})
for eachuniversity in eachbox:

For example,

import bs4 as bs
import urllib2
import re
import csv

page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
content = page.read()
soup = bs.BeautifulSoup(content)

with open('/tmp/ccc.csv', 'wb') as f:
    writer = csv.writer(f, delimiter=',', lineterminator='\n', )
    eachbox = soup.find_all('div', {'class':re.compile(r'members_box[12]')})
    for pair in zip(*[iter(eachbox)]*2):
        writer.writerow([text.strip() for item in pair for text in item.stripped_strings])

Note that you must remove the stray space after div in

soup.findAll('div ')

in order to find any <div> tags.


The code above uses the very handy grouper idiom:

zip(*[iter(iterable)]*n)

This expression collects n items from iterable and groups them into a tuple. So this expression allows you to iterate over chunks of n items. I've made a poor attempt to explain how the grouper idiom works here.

Upvotes: 3

abarnert
abarnert

Reputation: 366083

The problem is that you're adding eachbox2 to each data, instead of to the list of things to loop over.

On top of that, you've got a stray space, 'div ' instead of 'div', that causes eachbox2 to be an empty list.

Try this:

eachbox1 = soup.findAll('div', {'class':'members_box1'})
eachbox2 = soup.findAll('div', {'class':'members_box2'})
for eachuniversity in eachbox1 + eachbox2:
    data =  [re.sub('\s+', ' ', text).strip().encode('utf8') for text in eachuniversity.find_all(text=True) if text.strip()]

This isn't really the best way to do things, it's just the simplest fix for your existing way of doing things. BeautifulSoup offers various different ways to search for multiple things in one query—e.g., you can search based on a tuple of values ('members_box1', 'members_box2'), or a regexp (re.compile(r'members_box[12]')), or a filter function (lambda c: c in 'members_box1', 'members_box2')…

Upvotes: 3

Related Questions