Venkateshwaran Selvaraj
Venkateshwaran Selvaraj

Reputation: 1785

scraping using BS4 python

I am using the following code to scrape data from website.

from bs4 import BeautifulSoup
import urllib2
import re
for i in xrange(1,461,10):
  try:
    page = urllib2.urlopen("http://cms.onlinedemos.in/directory.php?click=n&startline={}#lst".format(i))
  except urllib2.HTTPError:
    continue
  else:
    pass
  finally:
    soup = BeautifulSoup(page)
    td1=soup.findAll('td', {'class':'comtext'})
    td2 = soup.findAll('td',{'class':'comuser'})
    td3 = soup.findAll('td',{'class':'com'})
    for td1s, td2s, td3s in zip(td1,td2,td3):
      data = [re.sub('\s+', '', text).strip().encode('utf8') for text in td1s.find_all(text=True) + td2s.find_all(text=True) + td3s.find_all(text=True)  if text.strip()]
      print ','.join(data)

My output is

A.T.E.EnterprisesPvt.Ltd.,,AnujBhagwati
A.T.E.Pvt.Ltd.,,AtulBhagwati
AalidhraTextileEngineersLtd.,,HansrajGondalia,Mumbai
AarBeeAssociates,Mr.Gopalsamy,022-22872245
ABCarterIndiaPvt.Ltd.,,B.B.Shetty,[email protected]
ABCCorporation,MittalPatel,Mumbai
ABCIndustrialFasteners,S.R.Sheth,022-22872245

But it is supposed to be like this

    A.T.E. Enterprises Pvt. Ltd.,   Anuj Bhagwati   Mumbai  022-22872245    [email protected]    

    A.T.E. Pvt. Ltd.,   Atul Bhagwati   Mumbai  022-22872245    [email protected]    

    Aalidhra Textile Engineers Ltd.,    Hansraj Gondalia    Surat   0261-2279520/30/40  [email protected]    

    Aar Bee Associates  Mr. Gopalsamy   Coimbatore  0422-2236250 / 2238560  [email protected]  

So you can see that the first row values Mumbai 022-22872245 [email protected] starts falling in third , fourth and fifth row. and it continues for all. I do know where I went wrong.

Upvotes: 1

Views: 1784

Answers (2)

duhaime
duhaime

Reputation: 27594

@VooDooNOFX is right. To modify your code accordingly, try something like this:

from bs4 import BeautifulSoup
import urllib2
import re
for i in xrange(1,461,10):
  try:
    page = urllib2.urlopen("http://cms.onlinedemos.in/directory.php?click=n&startline={}#lst".format(i))
  except urllib2.HTTPError:
    continue
  else:
    pass
  finally:
    soup = BeautifulSoup(page) 
    td1=soup.findAll('td', {'class':'comtext'})    
    td2 = soup.findAll('td',{'class':'comuser'})
    td345 = soup.findAll('td',{'class':'com'})
    #for td3, td4, and td5, use slicing method: s[i:j:k] slice of s from i to j with step k
    td3 = td345[0::3]
    td4 = td345[1::3]
    td5 = td345[2::3]
    for td1s, td2s, td3s, td4s, td5s in zip(td1,td2,td3,td4,td5):
      data = [re.sub('\s+', ' ', text).strip().encode('utf8').replace(",", "") for text in td1s.find_all(text=True) + td2s.find_all(text=True) + td3s.find_all(text=True) + td4s.find_all(text=True) + td5s.find_all(text=True) if text.strip()]
      print ', '.join(data)

Output for the first page:

A.T.E. Enterprises Pvt. Ltd., Anuj Bhagwati, Mumbai, 022-22872245, [email protected]
A.T.E. Pvt. Ltd., Atul Bhagwati, Mumbai, 022-22872245, [email protected]
Aalidhra Textile Engineers Ltd., Hansraj Gondalia, Surat, 0261-2279520/30/40, [email protected]
Aar Bee Associates, Mr. Gopalsamy, Coimbatore, 0422-2236250 / 2238560, [email protected]
AB Carter India Pvt. Ltd., B.B. Shetty, Mumbai, 022-66662961 / 62, [email protected]
ABC Corporation, Mittal Patel, Ahmedabad, 079-40068999 / 26582333, [email protected]
ABC Industrial Fasteners, S.R. Sheth, Mumbai, 022-28470806 / 66923987, [email protected]
Abhishek Enterprises, N.C. Jain, Bhilwara, 01482-264250, [email protected]
Accurate Trans Heat Pvt. Ltd., Kedarmal Dargar, Surat, 0261-2397268, [email protected]

Upvotes: 1

VooDooNOFX
VooDooNOFX

Reputation: 4762

Taking a look at the HTML of this page, there are 3 columsn of class com for every row. Zipping a list of 10 items with another list of 10 items with a third list of 30 items will result in the type of output you're getting.

>>> len(td3)
30
>>> td3[0:3]
[<td class="com" width="100"></td>, <td class="com" width="160"></td>, <td class="com" width="185"></td>]
>>> td3[3:6]
[<td class="com" width="100">Mumbai</td>, <td class="com" width="160">022-22872245</td>, <td class="com" width="185">[email protected]</td>]

Upvotes: 2

Related Questions