Scraping using beautiful soup python

Question


                    
                        1
                    
                    
                        Name:Mr.Jagadhesan.S
                        Designation:Proprietor
                        CODISSIA - Designation:(Founder President, CODISSIA)
                        Name of the Industry:Govardhana Engineering Industries
                        Specification:LIFE
                        Date of Admission:19.12.1969
                    
                    
                        Ukkadam South
                        Phone:2320085, 2320067
                        Email:jagadhesan@infognana.com                       
                    


                    
                        2
                    
                    
                        Name:Mr.Somasundaram.A
                        Designation:Proprietor

                        Name of the Industry:Everest Engineering Works
                        Specification:LIFE
                        Date of Admission:19.12.1969
                    
                    
                        Alagar Nivas, 284 NSR Road
                        Phone:2435674      
                        Factory Address
                        Coimbatore - 641 027
                        Phone:2435674

I have the above structure. From that I am trying to scrape the texts inside div of class members_box1 and members_box2 only.

I have the following script which does get data from only members_box1

from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('div',{'class':'members_box1'}):
    data =  [re.sub('\s+', ' ', text).strip().encode('utf8') for text in eachuniversity.find_all(text=True) if text.strip()]
    print '
'

This is how I tried to get data from both the boxes

from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
soup = BeautifulSoup(page.read())
eachbox2 = soup.findAll('div ', {'class':'members_box2'})
for eachuniversity in soup.findAll('div',{'class':'members_box1'}):
    data =  eachbox2 + [re.sub('\s+', ' ', text).strip().encode('utf8') for text in eachuniversity.find_all(text=True) if text.strip()]
    print data

But I am getting the same result as I get for just members_box1

UPDATE

I want to the output to be like this (in single line) for an iteration

Name:,Mr.Srinivasan.N,Designation:,Proprietor,CODISSIA - Designation:,(Past President, CODISSIA),Name of the Industry:,Arian Soap Manufacturing Co,Specification:,LIFE,Date of Admission:,19.12.1969, "Parijaat" 26/1Shanker Mutt Road, Basavana Gudi,Phone:,2313861

But I am getting as follows

Name:,Mr.Srinivasan.N,Designation:,Proprietor,CODISSIA - Designation:,(Past President, CODISSIA),Name of the Industry:,Arian Soap Manufacturing Co,Specification:,LIFE,Date of Admission:,19.12.1969
"Parijaat" 26/1Shanker Mutt Road, Basavana Gudi,Phone:,2313861

unutbu · Accepted Answer

You could use regex to match either members_box1 or members_box2:

import re
eachbox = soup.findAll('div', {'class':re.compile(r'members_box[12]')})
for eachuniversity in eachbox:

For example,

import bs4 as bs
import urllib2
import re
import csv

page = urllib2.urlopen("http://www.codissia.com/member/members-directory/?mode=paging&Keyword=&Type=&pg=1")
content = page.read()
soup = bs.BeautifulSoup(content)

with open('/tmp/ccc.csv', 'wb') as f:
    writer = csv.writer(f, delimiter=',', lineterminator='
', )
    eachbox = soup.find_all('div', {'class':re.compile(r'members_box[12]')})
    for pair in zip(*[iter(eachbox)]*2):
        writer.writerow([text.strip() for item in pair for text in item.stripped_strings])

Note that you must remove the stray space after div in

soup.findAll('div ')

in order to find any

tags.

The code above uses the very handy grouper idiom:

zip(*[iter(iterable)]*n)

This expression collects n items from iterable and groups them into a tuple. So this expression allows you to iterate over chunks of n items. I've made a poor attempt to explain how the grouper idiom works here.

Scraping using beautiful soup python

Answers (2)

Related Questions